random swap algorithm - itä-suomen yliopistocs.uef.fi/pages/franti/cluster/randomswap.pdf · 2018....
TRANSCRIPT
Random Swapalgorithm
Pasi Fränti24.4.2018
Definitions and data
Set of N data points:X={x1 , x2 , …, xN
}
Set of k cluster prototypes (centroids):C={c1 , c2 , …, ck
},
P={p1 , p2 , …, pk
},
Partition of the data:
N
iPi iCxd
dNMSE
1
21
NiCxP jikj
i ,1 minarg2
1
kjxCjPjP
ijii
,1 1
Clustering problem
Objective function:
Optimality of partition:
Optimality of centroid:
K-means algorithmX = Data setC = Cluster centroidsP = Partition
K-Means(X, C) → (C, P)REPEAT
Cprev
←
C;FOR i=1 TO
N DO
pi ←
FindNearest(xi , C);
FOR j=1
TO
k DOcj ←
Average of xi
pi = j;
UNTIL C = Cprev
Optimal partition
Optimal centoids
Problems of k-means
Swapping strategy
Pigeon hole principle
P. Fränti, M. Rezaei and Q. Zhao"Centroid index: cluster level similarity measure”Pattern Recognition, 47 (9), 3034-3045, September 2014, 2014.
CI = Centroid index:
Pigeon hole principle
S2
Aim of the swap
Two centroids , butonly one cluster .
One centroid , buttwo clusters .
Two centroids , butonly one cluster .
One centroid , buttwo clusters .
S2
Random Swap algorithm
Random Swap(X) C, P
C Select random representatives(X); P Optimal partition(X, C); REPEAT T times
(Cnew,j) Random swap(X, C); Pnew Local repartition(X, Cnew, P, j); Cnew, Pnew Kmeans(X, Cnew, Pnew); IF f(Cnew, Pnew) < f(C, P) THEN
(C, P) Cnew, Pnew; RETURN (C, P);
Steps of the swap
1. Random swap:
2. Re-allocate vectors from old cluster:
3. Create new cluster:
),1(),,1( Nrandomikrandomjxc ij
jpicxdp ijikj
i
,minarg 2
1
p d x c i Nik j k p
i ki
arg min , ,2
1
Swap is made fromcentroid rich area tocentroid poor area.
Swap is made fromcentroid rich area tocentroid poor area.
Swap
Local re-partition
Re-allocate vectors
Create new cluster
Iterate by k-means1st iteration
Iterate by k-means2nd iteration
Iterate by k-means3rd iteration
Iterate by k-means16th iteration
Iterate by k-means17th iteration
Iterate by k-means18th iteration
Iterate by k-means19th iteration
Final result25 iterations
Extreme example
176.53
163.93 163.63 163.51 163.08
150
155160
165170
175
180185
190
K-means Random+ RS
K-means+ RS
Split +RS
Ward +RS
MSE
Bridge
Dependency on initial solution
Data sets
Data sets
Data sets visualized Images
Bridge House Miss America Europe
16-d 3-d 16-d 2-d
4x4 blocksframe differential
Differentialcoordinates
RGB color4x4 blocks
S1 S2 S3 S4
Unbalance DIM32
Data sets visualized Artificial
Data sets visualized Artificial
G2-2-30 G2-2-50 G2-2-70
A1 A2 AA3
Time complexity
Efficiency of the random swap
Total time to find correct clustering:– Time per iteration
Number of iterations
Time complexity of single iteration:– Swap: O(1)– Remove cluster: 2kN/k = O(N)– Add cluster: 2N = O(N)– Centroids: 2N/k + 2N/k + 2
= O(N/k)
– K-means: IkN = O(IkN)
Bottleneck!
Efficiency of the random swap
Total time to find correct clustering:– Time per iteration
Number of iterations
Time complexity of single iteration:– Swap: O(1)– Remove cluster: 2kN/k = O(N)– Add cluster: 2N = O(N)– Centroids: 2N/k + 2N/k + 2
= O(N/k)
– (Fast) K-means: 4N = O(N)
T. Kaukoranta, P. Fränti and O. Nevalainen"A fast exact GLA based on code vector activity detection"IEEE Trans. on Image Processing, 9 (8), 1337-1342, August 2000.
2 iterations only!
Estimated and observed steps
N=4096, k=256, N/k=16, 8Bridge
KM 1
KM 2
0
2
4
6
8
10
0 50 100 150 200 250 300 350 400 450
Tim
e (m
s)
Local repartition
Bridge
53 %
39 %
N = 4096k = 256 8
Processing time profile
KM 1
KM 2
0
40
80
120
160
200
0 100 200 300 400 500 600 700 800 900
Tim
e (m
s)
Local repartition
Birch1
49 %
48 %
CI=0 reached
N = 100.000k = 100 5.5
50 %44 %
Processing time profile
160
165
170
175
180
185
190
0.01 0.1 1 10 100Time (s)
MSE
Effect of K-means iterations
21
34
5
Bridge
0
2
4
6
8
10
12
0.1 1 10 100Time (s)
MSE
Effect of K-means iterations
213
45
Birch2
How many swaps?
Three types of swaps
20.12 15.8720.09
Before swapBefore swap AcceptedAccepted SuccessfulSuccessful
CI=2 CI=2 CI=1
• Trial swap• Accepted swap• Successful swap
MSE improvesCI improves
Accepted and successful swaps
CI=4CI=4 CI=9CI=9
Number of swaps needed Example with 35 clusters
A3A3
K-means clustering result(3 swaps needed)
Rem
ove
Remove
Added
Final clustering result
Number of swaps needed Example from image quantization
0 %
5 %
10 %
15 %
20 %
0 10 20 30 40 50 60 70 80 90 100
Swaps
Max = 322
Average = 35
90% cases < 70
Median = 26
S1
Statistical observationsN=5000, k=15, d=2, N/k=333, 4.1
N=5000, k=15, d=2, N/k=333, 4.8
0 %
5 %
10 %
15 %
20 %
0 10 20 30 40 50 60 70 80 90 100
Swaps
Max = 248Average = 25
90% cases < 50
Median = 18 S4
Statistical observations
Theoretical estimation
Probability of good swap
• Select a proper prototype to remove:– There are k clusters in total: premoval =1/k
• Select a proper new location:– There are N choices: padd =1/N– Only k are significantly different: padd =1/k
• Both happens same time:– k2 significantly different swaps.– Probability of each different swap is pswap =1/k2
– Open question: how many of these are good?
p = (/k)(/k) = O(/k)2
• Probability of not finding good swap:
Expected number of iterations
2
2
1loglogk
Tq
2
2
1log
log
k
qT
• Estimated number of iterations:
T
kq
2
2
1
0.000000001
0.00000001
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
0 50 100 150 200 250 300Iterations
q
Probability of failure (q) depending on T
3.81.8
2.75.2
0.6
0.2 0.3
1.3
18 21 17 22
0.1
1
10
100
S1 S2 S3 S4Dataset
q
q =0.1%
q =1%
q =10%
S 1-S 4
Observed probability (%) of failN=5000, k=15, d=2, N/k=333, 4.5
1.3 2.01.50.60.9
0.50.50
0.010.010.010.010.01
10.3 9.616.3 14.7 16.0 19.6
0.001
0.01
0.1
1
10
100
32 64 128 256 512 1024
Dimensions
q
q =0.1%
q =1%
q =10%
Dim 32-128
Observed probability (%) of failN=1024, k=16, d=32-128, N/k=64, 1.1
2
2
ln -αkqT
2
2
2222 -ln /
ln -/1ln
ln αkq
kαq
kαqT
Bounds for the iterations
Upper limit:
Lower limit similarly; resulting in:
Multiple swaps (w)
Probability for performing less than w swaps:
Expected number of iterations:
iTiw
i kkiT
q
2
2
2
21
01
2
2
2
2
1
log1ˆkwk
it
w
i
Expected time complexity
1. Linear dependency on N2. Quadratic dependency on k
(With large number of clusters, it can be too slow)
3. Logarithmic dependency on w(Close to constant)
4. Inverse dependency on (Higher the dimensionality, faster the method)
αNkw-N
αkwkNt
2
2
2 log log,ˆ
400
600
800
1000
1200
1400
1600
1800
1009080706050403020100Data size (thousands)
Itera
tions
Birch275%
25%N=100.000
Linear dependency on NN<100.000, k=100, d=2, N/k=1000, 3.1
Quadratic dependency on kN<100.000, k=100, d=2, N/k=1000, 3.1
0
200
400
600
800
1000
1200
1400
1009080706050403020100Clusters
Itera
tions
Birch2
75%
25%
Quadratic fit
Logarithmic dependency on w
Theory vs. reality
Neighborhood size
How much is
?
Voronoi neighbors Neighbors by distance
1
2
3
1
4
2
3
6
5
2(3k-6)/k = 6 – 12/k 2kD/2/k = O(2k D/2-1)
2-dim:D-dim:
kUpper limit:
S1S1
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
1 2 3 4 5 6 7 8 9
Number of neighbours
Freq
uenc
y
Average = 3.9
Observed number of neighbors Data set S2
• Five iterations of random swap clustering
• Each pair of prototypes A and B:1. Calculate the half point HP = (A+B)/22. Find the nearest prototype C for HP3. If C=A or C=B they are potential neighbors.
• Analyze potential neighbors:1. Calculate all vector distances across A and B2. Select the nearest pair (a,b)3. If d(a,b) < min( d(a,C(a), d(b,C(b) ) then Accept
•
= Number of pairs found / k
Estimate
Observed values of
Optimality
Multiple optima (=plateaus)• Very similar result (<0.3% diff. in MSE)• CI-values significantly high (9%)• Finds one of the near-optimal solutions
Experiments
Bridge
160
165
170
175
180
185
190
0 1 10 100 1000 10000Time
MSE
Random Swap
Repeated k-means(10.000 repeats)
510
40
50
3020
0
70
60
73 K-means(single)
61
(10k)(28)
(172)(146)
(13k)
(825)
(78k) (387k) (813k)
(365)
(3k)
Time-versus-distortionN=4096, k=256, d=16, N/k=16, 5.4
Missa
5.00
5.25
5.50
5.75
6.00
6.25
6.50
0 1 10 100 1000 10000Time
MSE
Random Swap
Repeated k-means(10.000 repeats)
510
4050
3020
0
70
60
8093
88
K-means(single)
(10k)(136)(62)
(501)
(23k)
(1632)
(70k) (940k)(135k)
Time-versus-distortionN=6480, k=256, d=16, N/k=25, 17.1
Birch1
4.50
5.00
5.50
6.00
6.50
7.00
0 1 10 100 1000 10000Time
MSE
Random Swap
Repeated k-means(2500 repeats)5
10
20
0
7
3
K-means(single)
(489)
(22)
(2)
(43)
(606)
Time-versus-distortionN=100.000, k=100, d=2, N/k=1000, 5.8
Birch2
0
2
4
6
8
10
0 1 10 100 1000 10000Time
MSE
Random Swap
Repeated k-means(2500 repeats)
5
10
20
0
18K-means
(single)
(845) (2500)
9(41)
(2067)
(64)
(34)
(1)
Time-versus-distortionN=100.000, k=100, d=2, N/k=1000, 3.1
Europe
0
2
4
6
8
10
12
14
16
1 10 100 1000 10000 100000 1000000Time
MSE
Random Swap
Repeated k-means(500 repeats)
51040 30 20 0
6041 39
K-means(single)
100
80
(576k)(58k)(305)
(11)
(24)
(44)(1888) (9k)
Time-versus-distortionN=169.673, k=256, d=2, N/k=663, 6.3
Unbalance
0
20
40
60
80
100
120
140
0.01 0.1 1 10 100 1000Time
MSE
Randomswap
Repeated k-means
1
2
3
0
3K-means(single)
(116)
2
(98)
(43)
(20)
(4)
1
(3.219)
1
(20.000)
Time-versus-distortionN=6500, k=8, d=2, N/k=821, 2.3
0 %
5 %
10 %
15 %
160 165 170 175 180 185MSE
164.69
Random swapBridge
179.82Repeated k-means
Variation of results
0 %
5 %
10 %
15 %
4.0 4.5 5.0 5.5 6.0 6.5MSE
4.64
Random swapBirch1
5.48Repeated k-means
3
45 6 7
89 10 11
Variation of results
• k-means (KM)
• k-means++• repeated k-means (RKM)• x-means• agglomerative clustering (AC)• random swap (RS)• global k-means
• genetic algorithm
Comparison of algorithms
D. Arthur and S. Vassilvitskii, "K-means++: the advantages of careful seeding", ACM-SIAM Symp. on Discrete Algorithms (SODA’07), New Orleans, LA, 1027-1035, January, 2007.
D. Pelleg, and A. Moore, "X-means: Extending k- means with efficient estimation of the number of clusters", Int. Conf. on Machine Learning, (ICML’00), Stanford, CA, USA, June 2000.
P. Fränti, T. Kaukoranta, D.-F. Shen and K.-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773- 777, May 2000.
A. Likas, N. Vlassis and J.J. Verbeek, "The global k-means clustering algorithm", Pattern Recognition 36, 451-461, 2003.
P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pat. Rec. Let., 21 (1), 61-68, January 2000.
Processing time
Clustering quality
Conclusions
What we learned?
1. Random swap is efficient algorithm
2. It does not converge to sub-optimal result
3. Expected processing has dependency:•Linear O(N) dependency on the size of data•Quadratic O(k2) on the number of clusters•Inverse O(1/) on the neighborhood size•Logarithmic O(log w) on the number of swaps
References•
P. Fränti, "Efficiency of random swap clustering", Journal of Big Data, 5:13, 1-29, 2018.
•
P. Fränti and J. Kivijärvi, "Randomised local search algorithm for the clustering problem", Pattern Analysis and Applications, 3 (4), 358-369, 2000.
•
P. Fränti, J. Kivijärvi and O. Nevalainen, "Tabu search algorithm for codebook generation in VQ", Pattern Recognition, 31 (8), 1139-1148, August 1998.
•
P. Fränti, O. Virmajoki and V. Hautamäki, “Efficiency of random swap based clustering", IAPR Int. Conf. on Pattern Recognition (ICPR’08), Tampa, FL, Dec 2008.
•
Pseudo code: http://cs.uef.fi/pages/franti/research/rs.txt
Supporting material
Implementations available:(C, Matlab, Java, Javascript, R and Python)http://www.uef.fi/web/machine-learning/software
Interactive animation:http://cs.uef.fi/sipu/animator/
Clusterator:http://cs.uef.fi/sipu/clusterator