final writeup - dartmouth collegetitle: final_writeup author: alisha d'souza created date:...

8
Automatic Pairing of Chromosomes Alisha DSouza Abstract Karyogram is a visual depiction of chromosomes as a pair-wise ordered arrangement. Chromosomes from 30 karyograms of the Lisbon-K1 dataset are automatically paired by a multiclass k-Nearest Neighbor classifier and the performance is discussed. Final Project Report CS 174

Upload: others

Post on 11-Jul-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

Automatic Pairing of Chromosomes

Alisha DSouza

Abstract

Karyogram is a visual depiction of chromosomes as a pair-wise ordered arrangement. Chromosomes from 30 karyograms of the Lisbon-K1 dataset are automatically paired by a multiclass k-Nearest Neighbor classifier and the performance is discussed.

Final Project Report CS 174

Page 2: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

1. INTRODUCTION

Karyotype [1] is a set of characteristics that describe the chromosomes in a cell. An ordered depiction of the karyotype, as an image, in a standard format, is called a karyogram; chromosomes are arranged in pairs by size (decreasing order) and centromere position. Study of karyograms is at the heart of cytogenetics. These analyses contribute greatly to the study of chromosomal abnormalities and aberrations, genetic disorders, taxonomical relationships etcetera.

In humans, somatic cells have 23 classes of chromosomes (22 autosomes and 2 sex chromosomes), and a total of 46 chromosomes per cell; 22 pairs of chromosomes are present in each cell. In order to develop a karyogram, cells arrested at the metaphase stage of cell division are stained, by a dye, such as Giemsa [2] and imaged. The chromosomes then need to be arranged in pairs in order of decreasing size. As a result of staining, each chromosome has a unique banding pattern that aids classification. This process of pairing and karyotyping is usually done manually and requires considerable time of an expert. Automating these is an active field of research [3] and is highly desirable. Figure 1 shows a karyogram from the Lisbon-K1 dataset, where chromosomes are arranged in the order of their class. Figure 2 shows a stained chromosome with distinct banding pattern.

1.1 DATASET

The Lisbon-K1 (LK1) dataset [3, 15], of chromosomes from bone marrow cells of leukemia patients, developed by the technicians of Institute of Molecular Medicine of Lisbon, was used for this project. The dataset contains 200 karyograms (9200 chromosomes). For the purpose of this project a subset of 33 Karyograms from this dataset was used. This dataset

1 2 3 4 5

6 7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 X Y

Dark bands

Fig.1 Karyogram 1 from Lisbon-K1 dataset. The chromosome class is indicated by the red numbering and the first pair is highlighted by a box.

Fig.2 Visible banding pattern

Page 3: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

is of much lower quality than other more widely used datasets. Figure 3 shows a comparison between the LK1 dataset and Copenhagen dataset.

1.2 RELATED WORK

For pairing based on classification, numerous methods of classifier design have been proposed in literature. For example, hidden markov models [5], template matching [6], neural network and multilayer perceptron [7] – [12], wavelet [13], fuzzy [6] and Bayes [9] classifiers have been proposed. Classification success is usually in the range of 70% to 80% with these (on high quality datasets), which is much lower than the accuracy of 99.70% achieved by a human expert [3]. Khmelinskii et al propose an algorithm that pairs chromosomes directly without accurately classifying them and assistance from a rough classification, performed using Support Vector Machine (SVM) classifier is used [14].

2. METHOD

The chromosomes available in each karyogram are ordered and arranged according to the class to which they belong. Figure 1 shows a karyogram image. The adopted methods for pairing uses the distance between feature vectors associated with each chromosome. The distances of a given chromosome from each chromosome in the training set are calculated and the chromosome is classified to the class that is nearest to it. The following steps describe two methods adopted and tested for pairing and classification.

2.1 FEATURE EXTRACTION

In order to build a metric for calculating distance between two chromosomes, some features need to be extracted. Preceding this the chromosome images are pre-processed and geometrically corrected so that their boundaries are more-or-less parallel and an axis of symmetry if drawn would be parallel to the lateral boundaries[4].

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 57, NO. 10, OCTOBER 2010 2587

Robust Band Profile Extraction Using ConstrainedNonparametric Machine-Learning Technique

Shadab Khan*, Joao Sanches, and Rodrigo Ventura

Abstract—A typical characteristic of images of bone marrowcells taken during mitosis is poor quality. This renders the taskof extraction of accurate band profile, representative of intensitydistribution over each chromosome, more challenging. A robustmethod is hence required to tackle this problem. An algorithm wasthus developed, which estimates a single-line medial axis, the basisfor computation of band profile. Medial axis was generated by com-puting a final prediction, using primary and secondary predictionsobtained by a nonparametric machine learning algorithm trainedwith data from chromosome’s skeleton, and geometrical propertiesof medial axis, respectively. Experiments were performed using theLK1 dataset. The algorithm was found capable of estimating a sat-isfactory single-line medial axis. Band profile obtained was foundto be a good representation of intensity levels in different regionsof chromosomes. Additionally, this algorithm is robust in terms ofgrowing a very small seed region into desired medial axis and alsohandling highly irregular chromosomes.

Index Terms—Band profile, biological cells, bone marrow cells,discrete-curve evolution (DCE), machine learning, medial axis.

I. INTRODUCTION

IN CYTOGENETICS, karyotype is the set of features thathave been used to study taxonomic relationships, chromo-

somal aberrations, and evolutionary biology. Manual procedureof karyotyping requires considerable time of an expert; thus,automating it is desirable. Classifier performance for automatickaryotyping purpose suffers from measurement degradation infeatures. Band profile is one such prominent feature, which hasbeen used widely [1]–[4]. The extracted band profile should bean accurate representation of the spatial distribution of intensityover chromosomal surface to increase the discriminative powerof the classifier. Thus, an algorithm that can accurately extractband profiles from the chromosomes is essential.

The ultimate scope of this study is automatic processing ofbone marrow cells’ images for leukemia diagnosis. This typeof analysis is favored by features, such as centromere position,high degree of condensation, and well-defined intensity distribu-tion over chromosomal surface. These features while typically

Manuscript received April 22, 2010; revised June 18, 2010; accepted July 6,2010. Date of pblication July 19, 2010; date of current version September 15,2010. This work was supported by the Fundacao para a Ciencia e a Tecnologia(Institute for Systems and Robotics/Instituto Superior Tecnico plurianual fund-ing) through the Programa de Investimentos e Despesas de Desenvolvimentoda Administracao Central (PIDDAC) Program funds. Asterisk indicates corre-sponding author.

∗S. Khan is with the Institute for Systems and Robotics, Technical SuperiorInstitute, Lisbon 1049-001, Portugal (e-mail: [email protected]).

J. Sanches and R. Ventura are with the Institute for Systems andRobotics, Technical Superior Institute, Lisbon 1049-001, Portugal. (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TBME.2010.2060196

Fig. 1. Image showing comparison between karyograms from LK1 (left) andCopenhagen (right) datasets.

present in classical, high-quality karyogram datasets such asEdinburgh [1] or Copenhagen [1], [11] which have been usedearlier for genetic analysis, are either very hard to extract or ap-parently absent from LK1 (see Fig. 1). Previously, medial axistransform [1], [11], [13] and recursive thinning [14] have beenpopularly used to obtain initial medial axis, and interpolate it tomeet the ends of the boundary for estimating a complete medialaxis. Different variations of piecewise curve fitting combinedwith interpolation have also been tested [3], [4]. These meth-ods suffer due to misidentification of the end pieces of medialaxis, thus producing error in its length and shape. A different ap-proach using vessel-tracking algorithm has also been taken [15].This approach uses center of mass (CoM) of the chromosomeas a seed region (SR) and this prevents its use for chromosomeshighly distorted, as the CoM tends to shift closer to the bound-ary. Thus, a robust algorithm to adapt to the deformation in thecontour of chromosome is required.

The algorithm proposed here begins with the computation ofmedial axis. It does so by first training a nonparametric machine-learning algorithm with training set taken from the skeleton ofthe chromosome being analyzed, from which a primary pre-diction is obtained. Using the information available about thedependence of points on medial axis over contour of chromo-some, a secondary prediction is computed as well. Finally, us-ing the primary and secondary predictions, a final prediction iscomputed, which is then appended to the SR. SR is the part ofthe skeleton, which is grown into medial axis. The algorithmrecursively continues until a complete single-line medial axishas been estimated, and as a last step, band profile is com-puted. The algorithm described is robust and computationallyinexpensive.

II. ALGORITHM DESCRIPTION

In this study, medial axis of a closed contour is defined asa single continuous curve transversing across the length of the

0018-9294/$26.00 © 2010 IEEE

Geometric Correction

Fig.3 Comparison of chromosomes of low-quality LK1 dataset (left) from bone-marrow cells and chromosomes from high quality Copenhagen dataset (right) [16]

Fig. 4 Geometric Correction

Page 4: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

The features considered can be grouped into size-based – length, width, ratio of length of width and area of bounding box – and patter-based features – band profile and mutual information. ❖ Band profile : Average intensity along each row of the corrected chromosome image.❖ Mutual Information : This feature is always measured for pair of chromosomes and cannot be calculated for a single chromosome. The mutual information MI between a pair of chromosome images IA and IB is:

where pAB (a,b) is the joint histogram of the images IA and IB and pA(a) and pB(b) are the histograms of each image respectively.

The following figure summarizes the above.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

5

smoothing, Fig.6.c), iii) Interpolation along orthogonallines to the smoothed medial axis, Fig.6.d-e) and iv)border regularization, Fig.6.f).

3) Shape normalization - The features used in the com-parison of chromosomes are grouped into two classes:a) geometric based and b) pattern based (G-banding).To compare chromosomes from a band pattern point ofview, geometrical and dimensional differences must beremoved, or at least attenuated. Therefore, a dimensionalscaling is performed before the pattern features is ex-tracted to make all the chromosome with the same sizeand aspect ratio by interpolating the original images, asshown in Fig.7.a-b).

4) Intensity compensation - The metaphase plaque fromwhich the chromosomes are extracted does not present auniform brightness and contrast. To compensate for thisinhomogeneity the spatially scaled images are histogramequalized [31] as shown in Figs.7b-c).

(a) (b)

(c) (d)

Fig. 7. Dimension and shape normalization and intensity equalization. a)Geometrically compensated image, b) Spatial normalization, c) Histogramequalization and d) Band profile.

B. Feature ExtractionThe processed images are used to extract discriminative

features to pair the chromosomes. We are using some of themost used features in the classification of chromosomes, butothers, such as the centromere location, are not used due thevery poor quality of the images. The extracted features, used tocompute the distance between two chromosomes in the pairingprocess according to a metric defined later, are the following:

1) Size/Area - This class of features includes the area inpixels of each chromosome, its perimeter, bounding boxdimensions and aspect ratio, extracted from the nonnormalized shape images.

2) Shape - Normalized area is computed as the ratiobetween the perimeter and the area of the normalizedshape images.

3) Pattern - Two classes of features are used to dis-criminate chromosomes pairs with respect to its patterncharacteristics:

- Band profile - Band profiles, like the one displayed inFig.7.d), are computed as the average intensity valuesacross each line of the shape normalized processedimage, h(n) = (1/N)

∑Ni=1 I(n, i) where N is the

number of columns of the image. To avoid measurementdegradation due to misalignment during the comparisonstep, the band profiles of two chromosomes are aligned.A shift constant τ is estimated by maximizing the crosscorrelation function of the two profile vectors, hi(n) andhj(n),

τ = arg maxτ

φi,j(τ) (1)

where φi,j(τ) = φ(hi(n), hj(n − τ)), is the crosscorrelation function [32]. The maximum of this function,when both profiles are aligned, occurs for τ = 0.The distance between the chromosomes with respect tothe band profile is the Euclidean distance between oneprofile and the other, shifted by τ ,

d(i, j) = ‖hi(n) − hj(n − τ)‖2. (2)

-Mutual Information - The Mutual Information (MI) isproposed in this paper as a new feature for chromosomepairing that aims at increasing the discriminative powerof the classifier with respect to the band pattern (G-banding) that characterizes each class of chromosomes.This measure is widely used in medical image pro-cessing, namely in medical image registration [26] andis particularly suitable to compare pattern similaritiesbased on the histograms of two images [31], such aschromosome images. This is a valid assumption sincegiven 2 chromosomes from the same class, the corre-sponding G-banding will overlap and maximal depen-dence between the gray value of the images will beobtained [26].The MI associated with two chromosome shape normal-ized images, IA(i, j) and IB(i, j), is defined as follows[26]:

MI(IA, IB) =∑

a,b

pAB(a, b) log

[pAB(a, b)

pA(a)pB(b)

]

(3)

where pA(a) and pA(b) are the histograms of the im-ages IA and IB respectively and pAB(a, b) is the jointhistogram of both images. Notice that this feature isnot associated with each chromosome individually, asthe previous ones, but is calculated for every pair. Thisproperty is particularly useful in our approach where thechromosomes are not individually classified.

The features extracted in this step are used to computea (44)2 × L matrix of distances, as shown in (8), where442 is the total number of chromosome pairs in a givenkaryogram, excluding the sexual pair and L is the total numberof features. The distance between two chromosomes withrespect to each kth feature, fk, is the absolute difference ofboth features when they are scalars, dk(a, b) = |fak−fbk|, anEuclidean distance when they are vectors (e.g. band profile),dk(a, b) =

√∑

n (fak(n) − fbk(n))2 and a single scalar inthe case of the Mutual Information.

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

Fig. 5 Summary of feature extraction

Geometric Correction

Geometric Correction

10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

Row number

Nor

mal

ized

inte

nsity

Band Profile

10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Row number

Nor

mal

ized

inte

nsity

Band Profile

Mutual Information

Page 5: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

2.2 CALCULATION OF DISTANCE BETWEEN CHROMOSOMES

Two approaches were adopted for the calculation of distance between pairs of chromosomes. The first was a weighted-distance approach and the second was a Euclidean- distance approach. Both of these are discussed.

✤ Weighted-Distance Approach

As proposed by [16], the distance between two chromosomes i and j with respect to the kth

feature is,

where w(k) is the weight associated with the kth feature and w represents the weight vector.

The weights w are obtained during the training step by a constrained optimization of the following objective,

Where V(i) is the set of chromosomes of the ith class and U(i) is the set of chromosomes containing no more than one chromosome from the ith class. So each wi is computed by minimizing the sum of intraclass distances and maximizing the sum of interclass distances. This constrained optimization problem is approached using the method of Lagrange multipliers and the cost function E (w) is then,

where is the Lagrange multiplier and  

Here each element di(k) is the distance between the ith pair of chromosomes from training set associated with the kth feature such that all pairs belong to class r. thus represents

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

s.t. kwik = 1

� �r = 1T⇥r � 1T ⇥r

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

⇥r

Page 6: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

intraclass distances. has a similar structure but involves all pairs from training set containing not more than one chromosome of class r. wr has the closed form solution and is now the unit vector along the direction of [16],

Once we have all wr’s distance between chromosomes i and j is,

✤ Euclidean-Distance ApproachHere the distance between chromosomes i and j is,

where dk (i, j) is the distance between the said pair with respect to the kth feature.

2.3 k-NEAREST NEIGHBOR CLASSIFICATION

For a given chromosome i the distances 𝒟(i, j), where j represents all chromosomes from the training set, are calculated. The chromosome is grouped into the class which is the majority. Chromosomes of the test karyogram are paired in this way.

3. RESULTS

We see an average accuracy of classification of 38.41% with the w e i g h t e d - d i s t a n c e approach and 52.65% with the Euclidean-distance approach for k=1. Figure 6 shows the error observed with change of k.

4. DISCUSSION

The pairing problem is approached as a multinomial classification problem with 22 classes. Greater accuracy is obtained by using the simpler Euclidean-Distance based k-NN classifier. This may be explained by the following:✤ The optimization of weights in the weighted-distance based classifier requires that band profiles are of the same length for all pairs of chromosomes and cross-correlation between each pair is not necessarily maximized.

0 5 10 15 20 2545

50

55

60

65

70

75

80

k

Erro

r %

Error

Euclidean distanceWeighted distance

⇥r

�r = 1T⇥r � 1T ⇥r

wr8r 2 [1, 22]

D(i, j) =LX

k=1

kdk(i, j)k2

Fig.6 Error versus k

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

C. Distance between chromosomes

The overall distance between two chromosomes involvingall features, given a vector of weights, w, is defined as aweighted distance computed as follows

D(i, j;w) =L

k=1

w(k)dk(i, j) (4)

where w(k) is the weight associated with the kth feature anddk(i, j) is the distance between the ith and jth chromosomeswith respect to the kth feature.

The proposed pairing algorithm is based on a supervisedclassifier, previously trained with manually paired imagesprovided by experts. During the training step, a set of vectorweights wr with 1 ≤ r ≤ N = 22 are estimated by using allpossible pairs of the training set.

The distance between two chromosomes is assumed to bethe smallest one among all weight vectors wr,

D(i, j) = minr∈{1,...,22}

D(i, j;wr) (5)

The vectors wr, obtained during the training step, are com-puted by minimizing an energy function under the constraint‖w‖ = 1,

wr = arg minw:‖w‖=1

E(w). (6)

where

E(wi) =∑

(a,b)∈V (i)

D(a, b;wi)

︸ ︷︷ ︸

intraclass distance

−∑

(a,b)∈U(i)

D(a, b;wi)

︸ ︷︷ ︸

interclass distance

(7)

where V (i) is the set of all pairs of chromosomes of theith class and U(i) is the set of all chromosomes where atmost one chromosome in each pair belongs to the ith class.Each weight vector wr is computed by minimizing the sum ofintraclass distances (between chromosomes of the same class)and maximizing the sum of interclass distances (betweenchromosomes where at most one of them belongs to that class).

Let us consider the following matrix where each element,di(k), is the distance associated with the kth feature of theith pair of chromosomes in the karyogram,

Θr =

d1(1) d1(2) d1(3) ... d1(L)d2(1) d2(2) d2(3) ... d2(L)d3(1) d3(2) d3(3) ... d3(L)

... ... ... ... ...dR(1) dR(2) dR(3) ... dR(L)

. (8)

Θr is a R × L matrix where L is the number of featuresused in the pairing process and R the number of different pairsof chromosomes in the training set from class r. Let us alsoconsider the matrix Θr with the same structure of Θr butnow involving all pairs of the training set where at most onechromosome in each pair belongs to the rth class.

By using the Lagrange method the energy function may bewritten as follows

E(wr) = Φrwr + γwTr wr (9)

where Φr = 1T Θr − 1T Θr is a line vector with length L, 1

is a column vector of ones and γ is the Lagrange multiplier.The minimizer of E(wr) is

wr = ΦTr /

ΦrΦTr = vers(Φr) (10)

where vers(Φr) is the unit length vector aligned with Φr.The equation (10) is used in the training step to compute

the set of vectors wr, with 1 ≤ r ≤ 22, which are then used inturn to compute the distance between two chromosomes usingthe expression (5).

The distances computed using expression (5) form a sym-metric matrix of distances D, where each element, D(i, j), isthe distance between the i-th and the j-th chromosomes,

D =

D(1, 1) D(1, 2) D(1, 3) ... D(1, 22)D(2, 1) D(2, 2) D(2, 3) ... D(2, 22)D(3, 1) D(3, 2) D(3, 3) ... D(3, 22)

... ... ... ... ...D(22, 1) D(22, 2) D(22, 3) ... D(22, 22)

.

III. CLASSIFIER

The pairing process is a computationally hard problembecause the optimal pairing must minimize the overall dis-tance, that is, the solution is the global minimum of thecost function. This problem can be stated as a combinatorialoptimization problem. Moreover, it can be formulated as aninteger programming problem, thus allowing for very efficientoptimization methods. To do so, the cost function, as well asthe constraints, have to be expressed by linear functions of thevariables.

Considering n chromosomes (for n even), a pairing assign-ment P is defined as a set of ordered pairs (i, j), such that(a) i %= j holds for any pair and (b) any given index i appearsin no more than one pair of the set. A pairing assignment issaid to be total if and only if, for any i = 1, . . . , n, thereis exactly one pair (r, s) in the set such that either i = r ori = s. The sum of distances implied by a pairing P can bewritten as

C(P) =∑

(i,j)∈P

D(i, j), (11)

and the goal of the pairing process is to find a total pairing Pthat minimizes C(P).

Note that the cost function (11) can be reformulated as amatrix inner product between the distance matrix D and apairing matrix X = {x(i, j)}, where

x(i, j) =

{

1 (i, j) ∈ P or (j, i) ∈ P0 otherwise

(12)

Thus, (11) can be re-written as C(P) = 12 D · X where ‘·’

denotes the usual matrix inner product, defined as follows

D · X :=∑

i

j

D(i, j)x(i, j) (13)

The cost function becomes then linear with the pairing matrixX. The entries of this matrix are the parameters with respectto which (13) is to be minimized.

In order for the matrix X to represent a valid total pairing,this matrix has to satisfy constraints (a) and (b) above, which

Authorized licensed use limited to: UNIVERSIDADE TECNICA DE LISBOA. Downloaded on March 17,2010 at 12:32:31 EDT from IEEE Xplore. Restrictions apply.

Page 7: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

✤ The Euclidean-distance based classifier does not require such a constraint so the relative position of band profiles for a pair of chromosomes can be adjusted so that cross-correlation is maximized.

The accuracy of 52.65% is acceptable and is comparable to an accuracy of < 50.50% achieved with a Nearest Neighbor classifier on the LK1 dataset reported by Khemlinskii et al [16].

5. CONCLUSIONS

Although the accuracy of classification is poor, it is promising. The observed trends are as expected. Use of a larger dataset, will enable a better understanding.

6. REFERENCES[1] http://en.wikipedia.org/wiki/Karyotype#cite_note-3

[2] http://en.wikipedia.org/wiki/Giemsa_stain

[3] A. Khmelinskii, R. Ventura and J. Sanches, “Chromosome Pairing for Karyotyping Purposes using Mutual Information,” Proceedings of the 5th IEEE International Symposium on Biomedical Imaging, 14-17 May 2008, pp 484-487.

[4] S. Khan, A. DSouza, J. Sanches and R. Ventura, “Geometric Correction of Deformed Chromosomes for Automatic Karyotyping” (accepted for publication).

[5] J. M. Conroy, R. L. Jr. Becker, W. Lefkowitz, K. L. Christopher, R. B. Surana, T. O’Leary, D. P. O’Leary, and T. G. Kolda, “Hidden markov models for chromosome identification,” in Proceedings of the 14th IEEE Symposium of Computer-Based Medical Systems, July 2001, pp. 473– 477.

[6] A. M. Badawi, K. G. Hasan, E. A. Aly, and R. A. Messiha, “Chromosomes classification based on neural networks, fuzzy rule based, and template matching classifiers,” in Proceedings of the 46th IEEE International Midwest Symposium on Circuits and Systems, Dec. 2003, vol. 1, pp. 383–387.

[7] J. R. Stanley, M. J. Keller, P. Gader, and W. C. Caldwell, “Data-driven homologue matching for chromosome identification,” IEEE Transactions on Medical Imaging, vol. 17, no. 3, pp. 451–462, 1998.

[8] M. Zardoshti-Kermani and A. Afshordi, “Classification of chromosomes using higher-order neural networks,” in IEEE International Conference on Neural Networks, Nov.-Dec. 1995, vol. 5, pp. 2587–2591.

[9] B. Lerner, H. Guterman, I. Dinstein, and Y. Romem, “A comparison of multilayer perceptron neural network and bayes piecewise classifier for chromosome classification,”

Page 8: final writeup - Dartmouth CollegeTitle: final_writeup Author: Alisha D'souza Created Date: 5/31/2012 1:37:24 AM

IEEE World Congress on Neural Networks, IEEE International Conference on Computational Intelligence, vol. 6, pp. 3472–3477, June-July 1994.

[10] B. Lerner, M. Levinstein, B. Rosenberg, H. Guterman, L. Dinstein, and Y. Romem, “Feature selection and chromosome classification using a multilayer perceptron neural network,” IEEE World Congress on Computational Intelligence., IEEE International Conference on Neural Networks, vol. 6, pp. 3540–3545, Jun.-Jul. 1994.

[11] B. Lerner, “Toward a completely automatic neural-network-based human chromosome analysis,” IEEE Transactions on Systems, Man and Cybernetics, Part B, vol. 28, no. 4, pp. 544–552, Aug. 1998.

[12]  J. M. Cho, “Chromosome classification using backpropagation neural networks,” IEEE Engineering in Medicine and Biology Magazine, vol. 19, no. 1, pp. 28–33, Jan.-Feb. 2000. [13] Q. Wu and K. R. Castleman, “Automated chromosome classification using wavelet-based band pattern descriptors,” in 13th IEEE Symposium on Computer-Based Medical Systems, June 2000, pp. 189–194.

[14] A. Khmelinskii, R. Ventura, J. Sanches, “Classifier-assisted metric for chromosome pairing,” IEEE International Conference of Engineering in Medicine and Biology Society, 2010, 6729-6732.

[15] http://mediawiki.isr.ist.utl.pt/wiki/Lisbon-K_Chromosome_Dataset

[16] A. Khmelinskii, R. Ventura, J. Sanches, “A Novel Metric for Bone Marrow Cells Chromosome Pairing,” IEEE Transactions on Biomedical Engineering, vol.57, no.6, pp.1420-1429, June 2010