an introduction to deterministic annealingapiacoa.org › ... › deterministic-annealing ›...
TRANSCRIPT
![Page 1: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/1.jpg)
An Introduction to Deterministic Annealing
Fabrice Rossi
SAMM – Université Paris 1
2012
![Page 2: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/2.jpg)
Optimization
I in machine learning:I optimization is one of the central toolsI methodology:
I choose a model with some adjustable parametersI choose a goodness of fit measure of the model to some dataI tune the parameters in order to maximize the goodness of fit
I examples: artificial neural networks, support vectormachines, etc.
I in other fields:I operational research: project planning, routing, scheduling,
etc.I design: antenna, wings, engines, etc.I etc.
![Page 3: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/3.jpg)
Easy or Hard?
I is optimization computationally difficult?
I the convex case is relatively easy:I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
![Page 4: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/4.jpg)
Easy or Hard?
I is optimization computationally difficult?I the convex case is relatively easy:
I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
![Page 5: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/5.jpg)
Easy or Hard?
I is optimization computationally difficult?I the convex case is relatively easy:
I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
![Page 6: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/6.jpg)
Easy or Hard?
I is optimization computationally difficult?I the convex case is relatively easy:
I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
![Page 7: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/7.jpg)
In machine learning
I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)
I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:
I kernel parametersI discrete parameters, e.g., feature selection
I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization
![Page 8: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/8.jpg)
In machine learning
I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)
I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:
I kernel parametersI discrete parameters, e.g., feature selection
I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization
![Page 9: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/9.jpg)
Particular cases
I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve
M∗ = arg minM∈M
E(M)
I example: graph clustering
I Mixed problems:I a solution spaceM× Rp
I an error measure E fromM× Rp to RI goal: solve
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W )
I example: clustering in Rn
![Page 10: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/10.jpg)
Particular cases
I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve
M∗ = arg minM∈M
E(M)
I example: graph clusteringI Mixed problems:
I a solution spaceM× Rp
I an error measure E fromM× Rp to RI goal: solve
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W )
I example: clustering in Rn
![Page 11: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/11.jpg)
Graph clustering
Goal: find an optimal clustering of a graph with N nodes in Kclusters
I M: set of all partitions of {1, . . . ,N} in K classes (theasymptotic behavior of |M| is roughly K N−1 for a fixed Kwith N →∞ )
I assignment matrix: a partition inM is described by N × Kmatrix M such that Mik ∈ {0,1} and
∑Kk=1 Mik = 1
I many error measures are available:I Graph cut measures (node normalized, edge normalized,
etc.)I ModularityI etc.
![Page 12: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/12.jpg)
Graph clustering
I original graph
I four clusters
1
2
3 4
5
67
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
![Page 13: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/13.jpg)
Graph clustering
I original graphI four clusters
![Page 14: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/14.jpg)
Graph visualization
womanbisexual manheterosexual man
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
![Page 15: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/15.jpg)
Graph visualization
womanbisexual manheterosexual man
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
![Page 16: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/16.jpg)
Graph visualization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
3536
37
38
39
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
![Page 17: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/17.jpg)
Graph visualization
womanbisexual manheterosexual man
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
![Page 18: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/18.jpg)
Vector quantization
I N observations (xi)1≤i≤N in Rn
I M: set of all partitionsI quantization error:
E(M,W ) =N∑
i=1
K∑k=1
Mik‖xi − wk‖2
I the “continuous” parameters are the prototypes wk ∈ Rn
I the assignment matrix notation is equivalent to thestandard formulation
E(M,W ) =N∑
i=1
‖xi − wk(i)‖2,
where k(i) is the index of the cluster to which xi is assigned
![Page 19: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/19.jpg)
Example
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 20: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/20.jpg)
Example
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 21: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/21.jpg)
Combinatorial optimization
I a research field by itselfI many problems are NP hard...
and one therefore relies onheuristics or specialized approaches:
I relaxation methodsI branch and bound methodsI stochastic approaches:
I simulated annealingI genetic algorithms
I in this presentation: deterministic annealing
![Page 22: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/22.jpg)
Combinatorial optimization
I a research field by itselfI many problems are NP hard... and one therefore relies on
heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:
I simulated annealingI genetic algorithms
I in this presentation: deterministic annealing
![Page 23: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/23.jpg)
Combinatorial optimization
I a research field by itselfI many problems are NP hard... and one therefore relies on
heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:
I simulated annealingI genetic algorithms
I in this presentation: deterministic annealing
![Page 24: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/24.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 25: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/25.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 26: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/26.jpg)
Optimization strategies
I mixed problem transformation
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W ),
I remove the continuous part
minM∈M,W∈Rp
E(M,W ) = minM∈M
(M 7→ min
W∈RpE(M,W )
)I or the combinatorial part
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)
I this is not alternate optimization
![Page 27: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/27.jpg)
Optimization strategies
I mixed problem transformation
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W ),
I remove the continuous part
minM∈M,W∈Rp
E(M,W ) = minM∈M
(M 7→ min
W∈RpE(M,W )
)I or the combinatorial part
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)I this is not alternate optimization
![Page 28: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/28.jpg)
Alternate optimization
I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence
I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with
W ik = 1∑N
j=1 M i−1jk
∑Nj=1 M i−1
jk xj
3. compute the optimal partition with M ijk = 1 if and only if
k = arg min1≤l≤K ‖xj −W il ‖2
4. go back to 2 until convergenceI alternate optimization converges to a local minimum
![Page 29: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/29.jpg)
Alternate optimization
I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence
I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with
W ik = 1∑N
j=1 M i−1jk
∑Nj=1 M i−1
jk xj
3. compute the optimal partition with M ijk = 1 if and only if
k = arg min1≤l≤K ‖xj −W il ‖2
4. go back to 2 until convergence
I alternate optimization converges to a local minimum
![Page 30: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/30.jpg)
Alternate optimization
I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence
I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with
W ik = 1∑N
j=1 M i−1jk
∑Nj=1 M i−1
jk xj
3. compute the optimal partition with M ijk = 1 if and only if
k = arg min1≤l≤K ‖xj −W il ‖2
4. go back to 2 until convergenceI alternate optimization converges to a local minimum
![Page 31: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/31.jpg)
Combinatorial first
I let consider the combinatorial first approach:
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)I not attractive a priori, as W 7→ minM∈M E(M,W ):
I has no reason to be convexI has no reason to be C1
I vector quantization example:
F (W ) =N∑
i=1
min1≤k≤K
‖xi − wk‖2
is neither convex nor C1
![Page 32: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/32.jpg)
Combinatorial first
I let consider the combinatorial first approach:
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)I not attractive a priori, as W 7→ minM∈M E(M,W ):
I has no reason to be convexI has no reason to be C1
I vector quantization example:
F (W ) =N∑
i=1
min1≤k≤K
‖xi − wk‖2
is neither convex nor C1
![Page 33: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/33.jpg)
Example
Clustering in 2 clusters elements from R
x
back to the example
![Page 34: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/34.jpg)
Example
ln F (W )
multiple local minima andsingularities: minimizingF (W ) is difficult
−4 −2 0 2 4
−4
−2
02
4
w1
w2
![Page 35: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/35.jpg)
Soft minimum
I a soft minimum approximation of minM∈M E(M,W )
Fβ(W ) = −1β
ln∑
M∈Mexp(−βE(M,W ))
solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1
I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then
limβ→∞
Fβ(W ) = E(M∗(W ),W ) = minM∈M
E(M,W )
![Page 36: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/36.jpg)
Soft minimum
I a soft minimum approximation of minM∈M E(M,W )
Fβ(W ) = −1β
ln∑
M∈Mexp(−βE(M,W ))
solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1
I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then
limβ→∞
Fβ(W ) = E(M∗(W ),W ) = minM∈M
E(M,W )
![Page 37: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/37.jpg)
Example
M
E(M,W )
![Page 38: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/38.jpg)
Example
β = 10−2
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 39: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/39.jpg)
Example
β = 10−1.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 40: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/40.jpg)
Example
β = 10−1
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 41: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/41.jpg)
Example
β = 10−0.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 42: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/42.jpg)
Example
β = 100
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 43: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/43.jpg)
Example
β = 100.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 44: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/44.jpg)
Example
β = 101
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 45: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/45.jpg)
Example
β = 101.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 46: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/46.jpg)
Example
β = 102
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
![Page 47: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/47.jpg)
Example
Fβ
β
1.00e−02 3.16e−02 1.00e−01 3.16e−01 1.00e+00 3.16e+00 1.00e+01 3.16e+01 1.00e+02
![Page 48: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/48.jpg)
Discussion
I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1
I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,
−1β
ln |M| ≤ Fβ(W )− minM∈M
E(M,W ) ≤ 0
I when β is close to 0, Fβ is generally easy to minimize
I negative aspects:I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?
I philosophy: where does Fβ(W ) come from?
![Page 49: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/49.jpg)
Discussion
I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1
I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,
−1β
ln |M| ≤ Fβ(W )− minM∈M
E(M,W ) ≤ 0
I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:
I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?
I philosophy: where does Fβ(W ) come from?
![Page 50: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/50.jpg)
Discussion
I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1
I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,
−1β
ln |M| ≤ Fβ(W )− minM∈M
E(M,W ) ≤ 0
I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:
I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?
I philosophy: where does Fβ(W ) come from?
![Page 51: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/51.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 52: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/52.jpg)
Computing Fβ(W )
I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM
I a particular case: clustering with an additive cost, i.e.
E(M,W ) =N∑
i=1
K∑k=1
Mikeik (W ),
where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)
I then
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
computational cost O(NK ) (compared to ' K N−1)
![Page 53: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/53.jpg)
Computing Fβ(W )
I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM
I a particular case: clustering with an additive cost, i.e.
E(M,W ) =N∑
i=1
K∑k=1
Mikeik (W ),
where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)
I then
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
computational cost O(NK ) (compared to ' K N−1)
![Page 54: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/54.jpg)
Proof sketch
I let Zβ(W ) be the partition function given by
Zβ(W ) =∑
M∈M
exp(−βE(M,W ))
I assignments inM are independent and the sum can berewritten
Zβ(W ) =∑
M1.∈CK
. . .∑
MN.∈CK
exp(−βE(M,W )),
with CK = {(1,0, . . . ,0), . . . , (0, . . . ,0,1)}I then
Zβ(W ) =N∏
i=1
∑Mi.∈CK
exp(−β∑
k
Mik eik (W ))
![Page 55: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/55.jpg)
Independence
I additive cost corresponds to some form of conditionalindependence
I given the prototypes, observations are assignedindependently to their optimal clusters
I in other words: the global optimal assignment is theconcatenation of the individual optimal assignment
I this is what makes alternate optimization tractable in theK-means despite its combinatorial aspect:
minM∈M
N∑i=1
K∑k=1
Mik‖xi − wk‖2
![Page 56: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/56.jpg)
Minimizing Fβ(W )
I assume E to be C1 with respect to W , then
∇Fβ(W ) =
∑M∈M exp(−βE(M,W ))∇W E(M,W )∑
M∈M exp(−βE(M,W ))
I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent
I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed
I if E is additive
∇Fβ(W ) =N∑
i=1
∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K
k=1 exp(−βeik (W ))
![Page 57: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/57.jpg)
Minimizing Fβ(W )
I assume E to be C1 with respect to W , then
∇Fβ(W ) =
∑M∈M exp(−βE(M,W ))∇W E(M,W )∑
M∈M exp(−βE(M,W ))
I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent
I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed
I if E is additive
∇Fβ(W ) =N∑
i=1
∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K
k=1 exp(−βeik (W ))
![Page 58: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/58.jpg)
Fixed point scheme
I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :
1. compute
(expectation phase)
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
2. keeping the µik constant, solve for W
(maximization phase)
N∑i=1
K∑k=1
µik∇W eik (W ) = 0
3. loop to 1 until convergence
I generally converges to a minimum of Fβ(W ) (for a fixed β)
![Page 59: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/59.jpg)
Fixed point scheme
I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :
1. compute (expectation phase)
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
2. keeping the µik constant, solve for W (maximization phase)
N∑i=1
K∑k=1
µik∇W eik (W ) = 0
3. loop to 1 until convergence
I generally converges to a minimum of Fβ(W ) (for a fixed β)
![Page 60: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/60.jpg)
Fixed point scheme
I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :
1. compute (expectation phase)
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
2. keeping the µik constant, solve for W (maximization phase)
N∑i=1
K∑k=1
µik∇W eik (W ) = 0
3. loop to 1 until convergenceI generally converges to a minimum of Fβ(W ) (for a fixed β)
![Page 61: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/61.jpg)
Vector quantization
I if eik (W ) = ‖xi − wk‖2 (vector quantization),
∇wk Fβ(W ) = 2N∑
i=1
µik (W )(wk − xi),
with
µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)
I setting ∇Fβ(W ) = 0 leads to
wk =1∑N
i=1 µik (W )
N∑i=1
µik (W )xi
![Page 62: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/62.jpg)
Links with the K-means
I if µil = δk(i)=l , we obtained the prototype update rule of theK-means
I in the K-means, we have
k(i) = arg min1≤l≤K
‖xi − wk‖2
I in deterministic annealing, we have
µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)
I this is a soft minimum version of the crisp rule of thek-means
![Page 63: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/63.jpg)
Links with EM
I isotropic Gaussian mixture with a unique and fixedvariance ε
pk (x |wk ) =1
(2πε)p/2 e−12ε‖xi−wk‖2
I given the mixing coefficients δk , responsibilities are
P(xi ∈ Ck |xi ,W ) =δke−
12ε‖xi−wk‖2∑K
l=1 δle−12ε‖xi−wl‖2
I identical update rules for WI β can be seen has an inverse variance: quantify the
uncertainty about the clustering results
![Page 64: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/64.jpg)
So far...
I if E(M,W ) =∑N
i=1∑K
k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize
Fβ(W ) for a fixed β
I but:I the k-means algorithm is also a EM like algorithm: it does
not reach a global optimumI how handle to handle β →∞?
![Page 65: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/65.jpg)
So far...
I if E(M,W ) =∑N
i=1∑K
k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize
Fβ(W ) for a fixed βI but:
I the k-means algorithm is also a EM like algorithm: it doesnot reach a global optimum
I how handle to handle β →∞?
![Page 66: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/66.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 67: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/67.jpg)
Limit case β → 0
I limit behavior of the EM scheme:I µik = 1
K for all i and k (and W !)I W is therefore a solution of
N∑i=1
K∑k=1
∇W eik (W ) = 0
I no iteration needed
I vector quantization:I we have
wk =1N
N∑i=1
xi
I each prototype is the center of mass of the data: only onecluster!
I in general the case β → 0 is easy (unique minimum undermild hypotheses)
![Page 68: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/68.jpg)
Limit case β → 0
I limit behavior of the EM scheme:I µik = 1
K for all i and k (and W !)I W is therefore a solution of
N∑i=1
K∑k=1
∇W eik (W ) = 0
I no iteration neededI vector quantization:
I we have
wk =1N
N∑i=1
xi
I each prototype is the center of mass of the data: only onecluster!
I in general the case β → 0 is easy (unique minimum undermild hypotheses)
![Page 69: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/69.jpg)
Example
Clustering in 2 clusters elements from R
x
back to the example
![Page 70: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/70.jpg)
Example
original problem
−4 −2 0 2 4
−4
−2
02
4
w1
w2
![Page 71: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/71.jpg)
Example
simple soft minimum withβ = 10−1
−4 −2 0 2 4
−4
−2
02
4
w1
w2
![Page 72: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/72.jpg)
Example
more complex softminimum with β = 0.5
−4 −2 0 2 4
−4
−2
02
4
w1
w2
![Page 73: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/73.jpg)
Increasing β
I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:
I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large
values for the gradient at some points
I path following strategy (homotopy):I for an increasing series (βl)l with β0 = 0I initialize W ∗
0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K
k=1∇W eik (W ) = 0I compute W ∗
l via the EM scheme for βl starting from W ∗l−1
I a similar strategy is used in interior point algorithms
![Page 74: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/74.jpg)
Increasing β
I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:
I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large
values for the gradient at some pointsI path following strategy (homotopy):
I for an increasing series (βl)l with β0 = 0I initialize W ∗
0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K
k=1∇W eik (W ) = 0I compute W ∗
l via the EM scheme for βl starting from W ∗l−1
I a similar strategy is used in interior point algorithms
![Page 75: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/75.jpg)
Increasing β
I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:
I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large
values for the gradient at some pointsI path following strategy (homotopy):
I for an increasing series (βl)l with β0 = 0I initialize W ∗
0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K
k=1∇W eik (W ) = 0I compute W ∗
l via the EM scheme for βl starting from W ∗l−1
I a similar strategy is used in interior point algorithms
![Page 76: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/76.jpg)
Example
Clustering in 2 classes of 6 elements from R
x
1.00 2.00 3.00 7.00 8.25
![Page 77: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/77.jpg)
Behavior of Fβ(W )
minM∈M E(M,W )
w1
w2
2 4 6 8
24
68
![Page 78: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/78.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−2
w1
w2
2 4 6 8
24
68
![Page 79: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/79.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−1.75
w1
w2
2 4 6 8
24
68
![Page 80: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/80.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−1.5
w1
w2
2 4 6 8
24
68
![Page 81: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/81.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−1.25
w1
w2
2 4 6 8
24
68
![Page 82: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/82.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−1
w1
w2
2 4 6 8
24
68
![Page 83: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/83.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−0.75
w1
w2
2 4 6 8
24
68
![Page 84: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/84.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−0.5
w1
w2
2 4 6 8
24
68
![Page 85: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/85.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 10−0.25
w1
w2
2 4 6 8
24
68
![Page 86: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/86.jpg)
Behavior of Fβ(W )
Fβ(W ), β = 1
w1
w2
2 4 6 8
24
68
![Page 87: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/87.jpg)
Behavior of Fβ(W )
minM∈M E(M,W )
w1
w2
2 4 6 8
24
68
![Page 88: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/88.jpg)
One step closer to the final algorithm
I given an increasing series (βl)l with β0 = 0:1. compute W ∗
0 such that∑N
i=1∑K
k=1∇W eik (W ∗0 ) = 0
2. for l = 1 to L:2.1 initialize W ∗l to W ∗l−1
2.2 compute µik =exp(−βeik (W
∗l ))∑K
l=1 exp(−βeil (W∗l ))
2.3 update W ∗l such that∑N
i=1
∑Kk=1 µik∇W eik (W ∗l ) = 0
2.4 good back to 2.2 until convergence
3. use W ∗L as an estimation of
arg minW∈Rp (minM∈M E(M,W ))
I this is (up to some technical details) the deterministicannealing algorithm for minimizingE(M,W ) =
∑Ni=1∑K
k=1 Mikeik (W )
![Page 89: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/89.jpg)
What’s next?
I general topics:I why the name annealing?I why should that work? (and other philosophical issues)
I specific topics:I technical “details” (e.g., annealing schedule)I generalization:
I non additive criteriaI general combinatorial problems
![Page 90: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/90.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 91: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/91.jpg)
Annealing
I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.
I in our context, we’ll see thatI T = 1
kBβacts as a temperature and models some thermal
agitationI E(W ,M) is the energy of a configuration the system while
Fβ(W ) corresponds to the free energy of the system at agiven temperature
I increasing β reduces the thermal agitation
![Page 92: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/92.jpg)
Annealing
I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.
I in our context, we’ll see thatI T = 1
kBβacts as a temperature and models some thermal
agitationI E(W ,M) is the energy of a configuration the system while
Fβ(W ) corresponds to the free energy of the system at agiven temperature
I increasing β reduces the thermal agitation
![Page 93: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/93.jpg)
Simulated Annealing
I a classical combinatorial optimization algorithm forcomputing minM∈M E(M)
I given an increasing series (βl)l1. choose a random initial configuration M02. for l = 1 to L
2.1 take a small random step from Ml−1 to build Mcl (e.g.,
change the cluster of an object)2.2 if E(Mc
l ) < E(Ml−1) then set Ml = Mcl
2.3 else set Ml = Mcl with probability
exp(−βl(E(Mcl )− E(Ml−1)))
2.4 else set Ml = Ml−1 in the other case
3. use ML as an estimation of arg minM∈M E(M)
I naive vision:I always accept improvementI “thermal agitation” allows to escape local minima
![Page 94: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/94.jpg)
Statistical physics
I consider a system with state spaceMI denote E(M) the energy of the system when in state MI at thermal equilibrium with the environment, the probability
for the system to be in state M is given by the Boltzmann(Gibbs) distribution
PT (M) =1
ZTexp
(−E(M)
kBT
)with T the temperature, kB Boltzmann’s constant, and
ZT =∑
M∈Mexp
(−E(M)
kBT
)
![Page 95: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/95.jpg)
Back to annealing
I physical analogy:I maintain the system at thermal equilibrium:
I set a temperatureI wait for the system to settle at this temperature
I slowly decrease the temperature:I works well in real systems (e.g., crystallization)I allows the system to explore the state space
I computer implementation:I direct computation of PT (M) (or related quantities)I sampling from PT (M)
![Page 96: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/96.jpg)
Simulated AnnealingRevisited
I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ
I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:
I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability
min(
1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)
)I keep x t+1 = x t when this fails
![Page 97: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/97.jpg)
Simulated AnnealingRevisited
I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ
I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:
I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability
min(
1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)
)I keep x t+1 = x t when this fails
![Page 98: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/98.jpg)
Simulated AnnealingRevisited
I in SA, Q is the random local perturbationI major feature of Metropolis-Hastings MCMC:
PT (M ′)PT (M)
= exp(−E(M ′)− E(M)
kBT
)ZT is not needed
I underlying assumption, symmetric proposals
Q(x t |x ′) = Q(x ′|x t)
I rationale:I sampling directly from PT (M) for small T is difficultI track likely area during cooling
![Page 99: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/99.jpg)
Mixed problems
I Simulated Annealing applies to combinatorial problemsI in mixed problems, this would means removing the
continuous part
minM∈M,W∈Rp
E(M,W ) = minM∈M
(M 7→ min
W∈RpE(M,W )
)I in the vector quantization example
E(M) =N∑
i=1
K∑k=1
Mik
∥∥∥∥∥∥xi −1∑N
j=1 Mjk
N∑j=1
Mjkxj
∥∥∥∥∥∥2
I no obvious direct relation to the proposed algorithm
![Page 100: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/100.jpg)
Statistical physics
I thermodynamic (free) energy: the useful energy availablein a system (a.k.a., the one that can be extracted toproduce work)
I Helmholtz (free) energy is FT = U − TS, where U is theinternal energy of the system and S its entropy
I one can shown that
FT = −kBT ln ZT
I the natural evolution of a system is to reduce its freeenergy
I deterministic annealing mimicks the cooling process of asystem by tracking the evolution of the minimal free energy
![Page 101: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/101.jpg)
Gibbs distribution
I still no use of PT ...
I important property:I f a function defined onM
EPT (f (M)) =1
ZT
∑M∈M
f (M)exp(−E(M)
kBT
)I then
limT→0
EPT (f (M)) = f (M∗),
where M∗ = arg minM∈M E(M)
I useful to track global information
![Page 102: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/102.jpg)
Gibbs distribution
I still no use of PT ...I important property:
I f a function defined onM
EPT (f (M)) =1
ZT
∑M∈M
f (M)exp(−E(M)
kBT
)I then
limT→0
EPT (f (M)) = f (M∗),
where M∗ = arg minM∈M E(M)
I useful to track global information
![Page 103: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/103.jpg)
Membership functions
I in the EM like phase, we have
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
I this does not come out of thin air:
µik = EPβ,W (Mik ) =∑
M∈MMikPβ,W (M)
I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution
I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition
![Page 104: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/104.jpg)
Membership functions
I in the EM like phase, we have
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
I this does not come out of thin air:
µik = EPβ,W (Mik ) =∑
M∈MMikPβ,W (M)
I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution
I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition
![Page 105: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/105.jpg)
Membership functions
I in the EM like phase, we have
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
I this does not come out of thin air:
µik = EPβ,W (Mik ) =∑
M∈MMikPβ,W (M)
I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution
I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition
![Page 106: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/106.jpg)
Example
Clustering in 2 classes of 6 elements from R
x
1.00 2.00 3.00 7.00 8.25
![Page 107: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/107.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 108: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/108.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 109: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/109.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 110: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/110.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
before phase transition
![Page 111: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/111.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
after phase transition
![Page 112: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/112.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 113: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/113.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 114: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/114.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 115: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/115.jpg)
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
![Page 116: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/116.jpg)
Summary
I two principles:I physical systems tend to reach a state of minimal free
energy at a given temperatureI when the temperature is slowly decreased, a system tends
to reach a well organized stateI annealing algorithms tend to reproduce this slow cooling
behaviorI simulated annealing:
I uses Metropolis Hasting MCMC to sample the Gibbsdistribution
I easy to implement but needs a very slow coolingI deterministic annealing:
I direct minimization of the free energyI computes expectations of global quantities with respect to
the Gibbs distributionI aggressive cooling is possible, but ZT computation must be
tractable
![Page 117: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/117.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 118: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/118.jpg)
Maximum entropy
I another interpretation/justification of DAI reformulation of the problem: find a probability distribution
onM which is “regular” and gives a low average error
I in other words: find PW such thatI∑
M∈M PW (M) = 1I∑
M∈M E(W ,M)PW (M) is smallI the entropy of PW , −
∑M∈M ln PW (M)PW (M) is high
I entropy plays a regularization roleI this leads to the minimization of∑
M∈ME(W ,M)PW (M) +
1β
∑M∈M
ln PW (M)PW (M)
where β sets the trade-off between fitting and regularity
![Page 119: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/119.jpg)
Maximum entropy
I another interpretation/justification of DAI reformulation of the problem: find a probability distribution
onM which is “regular” and gives a low average errorI in other words: find PW such that
I∑
M∈M PW (M) = 1I∑
M∈M E(W ,M)PW (M) is smallI the entropy of PW , −
∑M∈M ln PW (M)PW (M) is high
I entropy plays a regularization role
I this leads to the minimization of∑M∈M
E(W ,M)PW (M) +1β
∑M∈M
ln PW (M)PW (M)
where β sets the trade-off between fitting and regularity
![Page 120: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/120.jpg)
Maximum entropy
I another interpretation/justification of DAI reformulation of the problem: find a probability distribution
onM which is “regular” and gives a low average errorI in other words: find PW such that
I∑
M∈M PW (M) = 1I∑
M∈M E(W ,M)PW (M) is smallI the entropy of PW , −
∑M∈M ln PW (M)PW (M) is high
I entropy plays a regularization roleI this leads to the minimization of∑
M∈ME(W ,M)PW (M) +
1β
∑M∈M
ln PW (M)PW (M)
where β sets the trade-off between fitting and regularity
![Page 121: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/121.jpg)
Maximum entropy
I some calculations show that the minimum over PW is givenby
Pβ,W (M) =1
Zβ,Wexp(−βE(M,W )),
withZβ,W =
∑M∈M
exp(−βE(M,W )).
I plugged back into the optimization criterion, we end upwith the soft minimum
Fβ(W ) = −1β
ln∑
M∈Mexp(−βE(M,W ))
![Page 122: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/122.jpg)
So far...
I three derivations of the deterministic annealing:I soft minimumI thermodynamic inspired annealingI maximum entropy principle
I all of them boil down to two principles:I replace the crisp optimization problem in theM space by a
soft oneI track the evolution of the solution when the crispness of the
approximation increasesI now the technical details...
![Page 123: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/123.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 124: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/124.jpg)
Fixed point
I remember the EM phase:
µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)
wk =1∑N
i=1 µik (W )
N∑i=1
µik (W )xi
I this is a fixed point method, i.e., W = Uβ(W ) for a datadependent Uβ
I the fixed point is generally stable, except during a phasetransition
![Page 125: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/125.jpg)
Phase transition
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
before phase transition
![Page 126: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/126.jpg)
Phase transition
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
after phase transition
![Page 127: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/127.jpg)
Stability
w1
w2
2 4 6 8
24
68
w1
w2
2 4 6 82
46
8
![Page 128: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/128.jpg)
Stability
I unstable fixed point:I even during a phase transition, the exact previous fixed
point can remain a fixed pointI but we want the transition to take place: this is a new
cluster birth!I fortunately, the previous fixed point is unstable during a
phase transition
I modified EM phase:1. initialize W ∗
l to W ∗l−1 + ε
2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))
3. update W ∗l such that
∑Ni=1∑K
k=1 µik∇W eik (W ∗l ) = 0
4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition
![Page 129: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/129.jpg)
Stability
I unstable fixed point:I even during a phase transition, the exact previous fixed
point can remain a fixed pointI but we want the transition to take place: this is a new
cluster birth!I fortunately, the previous fixed point is unstable during a
phase transitionI modified EM phase:
1. initialize W ∗l to W ∗
l−1 + ε
2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))
3. update W ∗l such that
∑Ni=1∑K
k=1 µik∇W eik (W ∗l ) = 0
4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition
![Page 130: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/130.jpg)
Annealing strategy
I neglecting symmetries, there are K − 1 phase transitionsfor K clusters
I tracking W ∗ between transitions is easy
I trade-off:I slow annealing: long running time but no transition is
missedI fast annealing: quicker but with higher risk to miss a
transitionI but critical temperatures can be computed:
I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
![Page 131: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/131.jpg)
Annealing strategy
I neglecting symmetries, there are K − 1 phase transitionsfor K clusters
I tracking W ∗ between transitions is easyI trade-off:
I slow annealing: long running time but no transition ismissed
I fast annealing: quicker but with higher risk to miss atransition
I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
![Page 132: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/132.jpg)
Annealing strategy
I neglecting symmetries, there are K − 1 phase transitionsfor K clusters
I tracking W ∗ between transitions is easyI trade-off:
I slow annealing: long running time but no transition ismissed
I fast annealing: quicker but with higher risk to miss atransition
I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
![Page 133: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/133.jpg)
Critical temperatures
I for vector quantizationI first temperature:
I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest
eigenvalue of the covariance matrix of the data
I in general:I the stability of the whole system is related to the stability of
each clusterI the µik play the role of membership functions for each
clusterI the critical β for cluster k is 1/2λk
max where λkmax is the
largest eigenvalue of∑i
µik (xi − wk )(xi − wk)T
![Page 134: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/134.jpg)
Critical temperatures
I for vector quantizationI first temperature:
I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest
eigenvalue of the covariance matrix of the dataI in general:
I the stability of the whole system is related to the stability ofeach cluster
I the µik play the role of membership functions for eachcluster
I the critical β for cluster k is 1/2λkmax where λk
max is thelargest eigenvalue of∑
i
µik (xi − wk )(xi − wk)T
![Page 135: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/135.jpg)
A possible final algorithm
1. initialize W 1 = 1N∑N
i=1 xi
2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1
2.3 for t values of β around βl
2.3.1 add some small noise to W l
2.3.2 compute µik =exp(−β‖xi−W l
k‖2)∑K
t=1 exp(−β‖xi−W lt ‖
2)
2.3.3 compute W lk = 1∑N
i=1 µik
∑Ni=1 µik xi
2.3.4 good back to 2.3.2 until convergence
3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M
![Page 136: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/136.jpg)
Computational cost
I N observations in Rp and K clustersI prototype storage: KpI µ matrix storage: NKI one EM iteration costs: O(NKp)I we need at most K − 1 full EM runs: O(NK 2p)I compared to the K-means:
I more storageI no simple distance calculation tricks: all distances must be
computedI roughly corresponds to K − 1 k-means, not taking into
account the number of distinct β considered during eachphase transition
I neglecting the eigenvalue analysis (in O(p3))...
![Page 137: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/137.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 138: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/138.jpg)
Collapsed prototypes
I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:
I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...
I a related problem: uniform cluster weightsI in a Gaussian mixture
P(xi ∈ Ck |xi ,W ) =δke−
12ε‖xi−wk‖2∑K
l=1 δle−12ε‖xi−wl‖2
prototypes are weighted by the mixing coefficients δ
![Page 139: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/139.jpg)
Collapsed prototypes
I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:
I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...
I a related problem: uniform cluster weightsI in a Gaussian mixture
P(xi ∈ Ck |xi ,W ) =δke−
12ε‖xi−wk‖2∑K
l=1 δle−12ε‖xi−wl‖2
prototypes are weighted by the mixing coefficients δ
![Page 140: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/140.jpg)
Mass constrained DAI we introduce prototype weights δk and plug them into the
soft minimum formulation
Fβ(W , δ) = −1β
N∑i=1
lnK∑
k=1
δk exp(−βeik (W ))
I Fβ is minimized under the constraint∑K
k=1 δk = 1I this leads to
µik (W ) =δk exp(−β‖xi − wk‖2)∑Kl=1 δl exp(−β‖xi − wl‖2)
δk =N∑
i=1
µik (W )/N
wk = NN∑
i=1
µik (W )xi/δk
![Page 141: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/141.jpg)
MCDA
I increases the similarity with EM for Gaussian mixtures:I exactly the same algorithm for a fixed βI isotropic Gaussian mixture with variance 1
2β
I however:I the variance is generally a parameter in the Gaussian
mixturesI different goals: minimal distortion versus maximum
likelihoodI no support for other distributions in DA
![Page 142: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/142.jpg)
Cluster birth monitoring
I avoid wasting computational resources:I consider a cluster Ck with a prototype wkI prior a phase transition in Ck :
I duplicate wk into w ′kI apply some noise to both prototypesI split δk into δk
2 and δ′k2
I apply the EM like algorithm and monitor ‖wk − w ′k‖I accept the new cluster if ‖wk − w ′k‖ becomes “large”
I two strategies:I eigenvalue based analysis (costly, but quite accurate)I opportunistic: always maintain duplicate prototypes and
promote diverging ones
![Page 143: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/143.jpg)
Final algorithm
1. initialize W 1 = 1N∑N
i=1 xi
2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1
2.3 duplicate the prototype of the critical cluster and split theassociated weight
2.4 for t values of β around βl
2.4.1 add some small noise to W l
2.4.2 compute µik =δk exp(−β‖xi−W l
k‖2)∑K
t=1 δt exp(−β‖xi−W lt ‖
2)
2.4.3 compute δk =∑N
i=1 µik (W )/N2.4.4 compute W l
k = Nδk
∑Ni=1 µik xi
2.4.5 good back to 2.4.2 until convergence
3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M
![Page 144: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/144.jpg)
Example
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
A simple dataset
![Page 145: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/145.jpg)
Example
β
E
0.2
0.5
1.0
2.0
5.0
0.1 0.15 0.22 0.32 0.47 0.69 1.02 1.51 2.22 3.37
Error evolution and cluster births
![Page 146: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/146.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 147: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/147.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 148: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/148.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 149: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/149.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 150: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/150.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 151: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/151.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 152: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/152.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 153: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/153.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 154: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/154.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 155: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/155.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 156: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/156.jpg)
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
![Page 157: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/157.jpg)
Summary
I deterministic annealing addresses combinatorialoptimization by smoothing the cost function and trackingthe evolution of the solution while the smoothingprogressively vanishes
I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)
I in practice:I rather simple multiple EM like algorithmI a bit tricky to implement (phase transition, noise injection,
etc.)I excellent results (frequently better than those obtained by
K-means)
![Page 158: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/158.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 159: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/159.jpg)
Combinatorial only
I the situation is quite different for a combinatorial problem
arg minM∈M
E(M)
I no soft minimum heuristicI no direct use of the Gibbs distribution PT ; one need to rely
on EPT (f (M)) for interesting fI calculation of PT is not tractable:
I tractability is related to independenceI if
E(M) =∑M1
. . .∑MD
(E(M1) + . . .+ E(MD))
then independent optimization can be done on eachvariable...
![Page 160: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/160.jpg)
Graph clustering
I a non oriented graph with N nodes and weight matrix A (Aijis the weight of the connection between nodes i and j)
I maximal modularity clustering:I node degree ki =
∑Nj=1 Aij , total weight m = 1
2
∑i,j Aij
I null model Pij =ki kj2m , Bij =
12m (Aij − Pij)
I assignment matrix MI Modularity of the clustering
Mod(M) =∑i,j
∑k
Mik Mjk Bij
I difficulty: coupling MikMjk (non linearity)I in the sequel E(M) = −Mod(M)
![Page 161: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/161.jpg)
Strategy for DA
I choose some interesting statistics:I EPT (f (M))I for instance EPT (Mik ) for clustering
I compute an approximation of EPT (f (M)) at hightemperature T
I track the evolution of the approximation while lowering TI use limT→0 EPT (f (M)) = f (arg minM∈M E(M)) to claim
that the approximation converges to the interesting limitI rationale: approximating EPT (f (M)) for a low T is probably
difficult, we use the path following strategy to ease the task
![Page 162: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/162.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
x
y
![Page 163: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/163.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
005
0.01
00.
015
0.02
0β = 10−1
x
prob
abili
ty
![Page 164: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/164.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
005
0.01
00.
015
0.02
0β = 10−0.75
x
prob
abili
ty
![Page 165: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/165.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
005
0.01
00.
015
0.02
00.
025
β = 10−0.5
x
prob
abili
ty
![Page 166: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/166.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
010
0.02
00.
030
β = 10−0.25
x
prob
abili
ty
![Page 167: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/167.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.01
0.02
0.03
0.04
β = 100
x
prob
abili
ty
![Page 168: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/168.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.02
0.04
0.06
0.08
β = 100.25
x
prob
abili
ty
![Page 169: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/169.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
β = 100.5
x
prob
abili
ty
![Page 170: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/170.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
β = 100.75
x
prob
abili
ty
![Page 171: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/171.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.1
0.2
0.3
0.4
0.5
β = 101
x
prob
abili
ty
![Page 172: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/172.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
β = 101.25
x
prob
abili
ty
![Page 173: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/173.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8β = 101.5
x
prob
abili
ty
![Page 174: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/174.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
β = 101.75
x
prob
abili
ty
![Page 175: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/175.jpg)
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
β = 102
x
prob
abili
ty
![Page 176: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/176.jpg)
Example
0.1 0.5 1.0 5.0 10.0 50.0 100.0
0.0
0.1
0.2
0.3
0.4
0.5
β
x̂
![Page 177: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/177.jpg)
Expectation approximation
I computing
EPZ (f (Z )) =
∫f (Z )dPZ
has many applicationsI for instance, in machine learning:
I performance evaluation (generalization error)I EM for probabilistic modelsI Bayesian approaches
I two major numerical approaches:I samplingI variational approximation
![Page 178: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/178.jpg)
Sampling
I (strong) law of large numbers
limN→∞
1N
N∑i=1
f (zi) =
∫f (Z )dPZ
if the zi are independent samples of ZI many extensions, especially to dependent samples (slower
convergence)I MCMC methods: build a Markov chain with stationary
distribution PZ and take the average of f on a trajectory
I applied to EPT (f (M)): this is simulated annealing!
![Page 179: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/179.jpg)
Sampling
I (strong) law of large numbers
limN→∞
1N
N∑i=1
f (zi) =
∫f (Z )dPZ
if the zi are independent samples of ZI many extensions, especially to dependent samples (slower
convergence)I MCMC methods: build a Markov chain with stationary
distribution PZ and take the average of f on a trajectoryI applied to EPT (f (M)): this is simulated annealing!
![Page 180: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/180.jpg)
Variational approximation
I main idea:I the intractability of the calculation of EPZ (f (Z )) is induced
by the complexity of PZI let’s replace PZ by a simpler distribution QZ ...I ...and EPZ (f (Z )) by EQZ (f (Z ))
I calculus of variation:I optimization over functional spacesI in this context: choose an optimal QZ in a space of
probability measures Q with respect to a goodness of fitcriterion between PZ and QZ
I probabilistic context:I simple distributions: factorized distributionsI quality criterion: Kullback-Leibler divergence
![Page 181: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/181.jpg)
Variational approximation
I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D
i=1 fi(Zi)
I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))
I variational approximationI choose Q a set of tractable distributions on RI solve
Q∗ = arg minQ=(Q1×...×QD)∈QD
KL (Q||PZ )
withKL (Q||PZ ) = −
∫ln
dPZ
dQdQ
I approximate EPZ (f (Z )) by
EQ∗ (f (Z )) =D∑
i=1
EQ∗i (fi(Zi))
![Page 182: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/182.jpg)
Variational approximation
I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D
i=1 fi(Zi)
I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))
I variational approximationI choose Q a set of tractable distributions on RI solve
Q∗ = arg minQ=(Q1×...×QD)∈QD
KL (Q||PZ )
withKL (Q||PZ ) = −
∫ln
dPZ
dQdQ
I approximate EPZ (f (Z )) by
EQ∗ (f (Z )) =D∑
i=1
EQ∗i (fi(Zi))
![Page 183: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/183.jpg)
Application to DA
I general form: Pβ(M) = 1Zβ
exp(−βE(M))
I intractable because E(M) is not linear in M = (M1, . . . ,MD)
I variational approximation:I replace E by F (W ,M) defined by
F (W ,M) =D∑
i=1
WiMi
I use for QW ,β
QW ,β =1
ZW ,βexp(−βF (W ,M))
I optimize W via KL (QW ,β ||Pβ)
![Page 184: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/184.jpg)
Application to DA
I general form: Pβ(M) = 1Zβ
exp(−βE(M))
I intractable because E(M) is not linear in M = (M1, . . . ,MD)I variational approximation:
I replace E by F (W ,M) defined by
F (W ,M) =D∑
i=1
WiMi
I use for QW ,β
QW ,β =1
ZW ,βexp(−βF (W ,M))
I optimize W via KL (QW ,β ||Pβ)
![Page 185: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/185.jpg)
Fitting the approximation
I a complex expectation calculation is replaced by a newoptimization problem
I we have
KL(QW ,β||Pβ
)= ln Zβ−ln ZW ,β+βEQW ,β (E(M)− F (W ,M))
I tractable by factorization on F , except for ZβI solved by ∇W KL
(QW ,β||Pβ
)= 0
I tractable because Zβ do not depend on WI generally solved via a fixed point approach (EM like again)
I similar to variational approximation in EM or Bayesianmethods
![Page 186: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/186.jpg)
Example
I in the case of (graph) clustering, we use
F (W ,M) =N∑
i=1
K∑k=1
WikMik
I a crucial (general) property is that, when i 6= j
EQW ,β
(MikMjl
)= EQW ,β (Mik )EQW ,β
(Mjl)
I a (long and boring) series of equations leads to the generalmean field equations
∂EQW ,β (E(M))
∂Wjl=
K∑k=1
∂EQW ,β
(Mjk)
∂WjlWjk , ∀j , l .
![Page 187: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/187.jpg)
Example
I classical EM like scheme to solve the mean fieldequations:
1. we haveEQW,β (Mik ) =
exp(−βWik )∑Kl=1 exp(−βWil)
2. keep EQW,β (Mik ) fixed and solve the simplified mean fieldequations for W
3. update EQW,β (Mik ) based on the new W and loop on 2 untilconvergence
I in the graph clustering case
Wjk = 2∑i,j
EQW ,β (Mik )Bij
![Page 188: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/188.jpg)
Summary
I deterministic annealing for combinatorial optimizationI given an objective function E(M)
I choose a linear parametric approximation of E(M),F (W ,M) with the associated distributionQW ,β = 1
ZW,βexp(−βF (W ,M))
I write the mean field equations, i.e.
∇W(− ln ZW ,β + βEQW,β (E(M)− F (W ,M))
)= 0
I use a EM like algorithm to solve the equations:I given W , compute EQW,β (Mik )I given EQW,β (Mik ), solve the equations
I back to our classical questions:I why would that work?I how to do that in practice?
![Page 189: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/189.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 190: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/190.jpg)
Physical interpretation
I back to the (generalized) Helmholtz (free) energy
Fβ = U − Sβ
I the free energy can be generalized to any distribution Q onthe states, by
Fβ,Q = EQ (E(M))− H(Q)
β,
where H(Q) = −∑
M∈MQ(M) ln Q(M) is the entropy of QI the Boltzmann-Gibbs distribution minimizes the free
energy, i.e.FT ≤ FT ,Q
![Page 191: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/191.jpg)
Physical interpretation
I in fact we have
FT ,Q = FT +1β
KL (Q||P) ,
where P is the Gibbs distributionI minimizing KL (Q||P) ≥ 0 over Q corresponds to finding
the best upper bound of the free energy on a class ofdistribution
I in other words: given the system states must be distributedaccording to a distribution in Q, find the most stabledistribution
![Page 192: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/192.jpg)
Mean fieldI E(M) is difficult to handle because of coupling
(dependencies) while F (W ,M) is linear and correspondsto a de-coupling in which dependencies are replaced bymean effects
I example:I modularity
E(M) =∑
i
∑k
Mik
∑j
Mjk Bij
I approximation
F (W ,M) =∑
i
∑k
Mik Wik
I the complex influence of Mij on E is replaced by a singleparameter Wik : the mean effect associated to a change inMij
![Page 193: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/193.jpg)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
![Page 194: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/194.jpg)
In practice
I homotopy again:I at high temperature (β → 0):
I EQW,β (Mik ) =1K
I the fixed point equation is generally easy to solve; forinstance in graph clustering
Wjk =2K
∑i,j
Bij
I slowly decrease the temperatureI use the mean field at the previous higher temperature as a
starting point for the fixed point iterationsI phase transitions:
I stable versus unstable fixed points (eigenanalysis)I noise injection and mass constrained version
![Page 195: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/195.jpg)
A graph clustering algorithm
1. initialize W 1jk = 2
K∑
i,j Bij
2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1
2.3 for t values of β around βl
2.3.1 add some small noise to W l
2.3.2 compute µik = exp(−βWik )∑Kt=1 exp(−βWit )
2.3.3 compute Wjk = 2∑
i,j µik Bij
2.3.4 good back to 2.3.2 until convergence
3. threshold µik into an optimal partition of the original graph
I mass constrained approach variantI stability based transition detection (slower annealing)
![Page 196: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/196.jpg)
Karate
I clustering of asimple graph
I Zachary’s Karateclub
I Four clusters
1
2
3 4
5
67
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
![Page 197: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/197.jpg)
Modularity
Evolution of the modularity during annealing
50 100 200 500
0.0
0.1
0.2
0.3
0.4
ββ
Mod
ular
ity
![Page 198: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/198.jpg)
Phase transitions
I first phase: 2clusters
I second phase: 4clusters
1 2
3 4
![Page 199: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/199.jpg)
Phase transitions
I first phase: 2clusters
I second phase: 4clusters
1 2
3 4
![Page 200: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/200.jpg)
Karate
I clustering of asimple graph
I Zachary’s Karateclub
I Four clusters
![Page 201: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/201.jpg)
DA Roadmap
How to apply deterministic annealing to a combinatorialoptimization problem E(M)?
1. define a linear mean field approximation F (W ,M)
2. specialize the mean field equations, i.e. compute
∂EQW,β (E(M))
∂Wjl=
∂
∂Wjl
(1
ZW ,β
∑M
E(M)exp(−βF (W ,M))
)
3. identify a fixed point scheme to solve the mean field equations,using EQW,β (Mi) as a constant if needed
4. wrap the corresponding EM scheme in an annealing loop: this isthe basic algorithm
5. analyze the fixed point scheme to find stability conditions: thisleads to an advanced annealing schedule
![Page 202: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/202.jpg)
Summary
I deterministic annealing addresses combinatorialoptimization by solving a related but simpler problem andby tracking the evolution of the solution while the simplerproblem converges to the original one
I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)
I works mainly because of the solution following strategy(homotopy): do not solve a difficult problem from scratch,but rather starting from a good guess of the solution
![Page 203: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/203.jpg)
Summary
I difficulties:I boring and technical calculations (when facing a new
problem)I a bit tricky to implement (phase transition, noise injection,
etc.)I tends to be slow in practice: computing exp is very costly
even on modern hardware (roughly 100 flop)I advantages:
I versatileI appears as a simple EM like algorithm embedded in an
annealing loopI very interesting intermediate resultsI excellent final results in practice
![Page 204: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline](https://reader035.vdocuments.us/reader035/viewer/2022062414/5f0d094a7e708231d4385c8b/html5/thumbnails/204.jpg)
References
T. Hofmann and J. M. Buhmann.Pairwise data clustering by deterministic annealing.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(1):1–14, January 1997.
K. Rose.Deterministic annealing for clustering, compression,classification,regression, and related optimization problems.Proceedings of the IEEE, 86(11):2210–2239, November 1998.
F. Rossi and N. Villa.Optimizing an organized modularity measure for topographicgraph clustering: a deterministic annealing approach.Neurocomputing, 73(7–9):1142–1163, March 2010.