clustering with diversityyike/clusterslides.pdf · recall ‘-diversity: given a set v of points in...
TRANSCRIPT
![Page 1: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/1.jpg)
1-1
Clustering with Diversity
July 2010
Jian Li
University of Maryland, College Park
Ke Yi and Qin Zhang
Hong Kong University of Science & Technology
ICALP 2010
![Page 2: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/2.jpg)
2-1
Clustering - overview
A general clustering problem:
Given a set V of points in a metric space, partition Vinto a set of disjoint clusters such that a certain objectivefunction is minimized,
subject to some
![Page 3: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/3.jpg)
2-2
Clustering - overview
A general clustering problem:
Given a set V of points in a metric space, partition Vinto a set of disjoint clusters such that a certain objectivefunction is minimized,
subject to some
cluster-level constraints: impose restrictions, for exam-ple, on the number of clusters (exactly k clusters) oron the size of each cluster (at least l elements).
![Page 4: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/4.jpg)
2-3
Clustering - overview
A general clustering problem:
Given a set V of points in a metric space, partition Vinto a set of disjoint clusters such that a certain objectivefunction is minimized,
subject to some
cluster-level constraints: impose restrictions, for exam-ple, on the number of clusters (exactly k clusters) oron the size of each cluster (at least l elements).
instance-level constraints: specify whether particularpair of items can be clustered together based on somebackground knowledge.
![Page 5: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/5.jpg)
3-1
Clustering with diversity
all points partitioned into one cluster must havedistinct colors;
`-diversity: Given a set V of points in a metric space,each of which has a color, subject to:
each cluster has size ≥ l.
Goal (this paper): minimize max{diameter of clusters}.
![Page 6: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/6.jpg)
3-2
Clustering with diversity
all points partitioned into one cluster must havedistinct colors;
`-diversity: Given a set V of points in a metric space,each of which has a color, subject to:
each cluster has size ≥ l.
⇒ `-anonymity
Goal (this paper): minimize max{diameter of clusters}.
![Page 7: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/7.jpg)
4-1
The main motivation
`-diversity is motivated from privacy preservation for data pub-lication (Machanavajjhala et. al. 2006), follows `-anonymity(a.k.a. k-anonymity, Samarati 2001).
pneumonia
pneumonia
pneumonia
pneumonia
HIV
HIV
bronchitis
bronchitis
bronchitis
dyspepsia
DiseaseT-1D(Name)
1 (Adam)
2 (Bob)
3 (Calvin)
4 (Daisy)
5 (Elam)
6 (Frank)
7 (George)
8 (Henry)
9 (Ivy)
10 (Jane)
Age Gender Degree
29
25
25
29
40
45
35
37
50
60
M
M
M
F
F
F
M
M
M
M
M.Sc.
M.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
Ph.D.
Ph.D.
(a) The microdata
sensitiveattribute
quasi-identifier (QI)
![Page 8: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/8.jpg)
4-2
The main motivation
`-diversity is motivated from privacy preservation for data pub-lication (Machanavajjhala et. al. 2006), follows `-anonymity(a.k.a. k-anonymity, Samarati 2001).
pneumonia
pneumonia
pneumonia
pneumonia
HIV
HIV
bronchitis
bronchitis
bronchitis
dyspepsia
DiseaseT-1D(Name)
1 (Adam)
2 (Bob)
3 (Calvin)
4 (Daisy)
5 (Elam)
6 (Frank)
7 (George)
8 (Henry)
9 (Ivy)
10 (Jane)
Age Gender Degree
29
25
25
29
40
45
35
37
50
60
M
M
M
F
F
F
M
M
M
M
M.Sc.
M.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
Ph.D.
Ph.D.
(a) The microdata
sensitiveattribute
quasi-identifier (QI)
pneumonia
pneumonia
pneumonia
pneumonia
HIV
HIV
bronchitis
bronchitis
bronchitis
dyspepsia
DiseaseQIs
(25-29, M, MSc)
(25-29, *, BSc)
(40-45, M, BSc)
(35-37, M, BSc)
(50-60, F, PhD)
(b) A 2-anonymous table
![Page 9: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/9.jpg)
4-3
The main motivation
`-diversity is motivated from privacy preservation for data pub-lication (Machanavajjhala et. al. 2006), follows `-anonymity(a.k.a. k-anonymity, Samarati 2001).
pneumonia
pneumonia
pneumonia
pneumonia
HIV
HIV
bronchitis
bronchitis
bronchitis
dyspepsia
DiseaseT-1D(Name)
1 (Adam)
2 (Bob)
3 (Calvin)
4 (Daisy)
5 (Elam)
6 (Frank)
7 (George)
8 (Henry)
9 (Ivy)
10 (Jane)
Age Gender Degree
29
25
25
29
40
45
35
37
50
60
M
M
M
F
F
F
M
M
M
M
M.Sc.
M.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
B.Sc.
Ph.D.
Ph.D.
(a) The microdata
sensitiveattribute
quasi-identifier (QI)
pneumonia
pneumonia
pneumonia
pneumonia
HIV
HIV
bronchitis
bronchitis
bronchitis
dyspepsia
DiseaseQIs
(25-29, M, MSc)
(25-29, *, BSc)
(40-45, M, BSc)
(35-37, M, BSc)
(50-60, F, PhD)
(b) A 2-anonymous table
pneumonia
pneumonia
pneumonia
pneumonia
HIV
HIV
bronchitis
bronchitis
bronchitis
dyspepsia
DiseaseQIs
(c) An 2-diverse table
(25-29, M, *)
(25-29, *, *)
(50-60, F, PhD)
(35-40, M, BSc)
(37-45, M, BSc)
![Page 10: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/10.jpg)
5-1
Previous work
`-anonymity
A 2-approximation in the metric space is known(Aggarwal el. al. 2006).
![Page 11: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/11.jpg)
5-2
Previous work
`-anonymity
A 2-approximation in the metric space is known(Aggarwal el. al. 2006).
Many heuristic solutions have been proposed in the DB commu-nity (i.e. LeFevre et. al. 2006, Ghinita et. al. 2007).But no theoretical result.
`-diversity
![Page 12: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/12.jpg)
5-3
Previous work
`-anonymity
A 2-approximation in the metric space is known(Aggarwal el. al. 2006).
Many heuristic solutions have been proposed in the DB commu-nity (i.e. LeFevre et. al. 2006, Ghinita et. al. 2007).But no theoretical result.
`-diversity
Outliers (remove some points to get better clusters)
First considered by Charikar et. al. 2001 for facility locationand k-median.
A 4-approximation is known for `-anonymity(Aggarwal el. al. 2006).
![Page 13: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/13.jpg)
6-1
Related work on instance-level constraints
ML constraints and CL constraints (Wagstaff and Cardie 2000)
must-link (ML): two points must be clustered together.cannot-link (CL): two points must be separated.
`-diverse clustering can be seen as a special case where nodeswith the same color must satisfy CL constraints.
No approximation algorithm is studied.
![Page 14: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/14.jpg)
6-2
Related work on instance-level constraints
ML constraints and CL constraints (Wagstaff and Cardie 2000)
must-link (ML): two points must be clustered together.cannot-link (CL): two points must be separated.
`-diverse clustering can be seen as a special case where nodeswith the same color must satisfy CL constraints.
No approximation algorithm is studied.
Correlation clustering (Bansal et. al. 2004)
Minimize the violation of the given constraints.
Best approximation algorithms are due to Ailon et. al. 2008
![Page 15: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/15.jpg)
7-1
Our results for l-diversity
A 2-approximation algorithm
(if the problem has feasible solutions).
A matching lower bound assuming P 6= NP
(even there are only 3 colors)
An O(1)-approximation algorithm for the infeasible case.
(if the problem does not have a feasible solution, we remove the
least possible number of points to get a feasible solution)
![Page 16: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/16.jpg)
8-1
2-approximation algorithm
![Page 17: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/17.jpg)
9-1
Recall `-diversity: Given a set V of points in a metricspace, each of which has a color, subject to:
1. each cluster has size ≥ l;
2. all points partitioned into one cluster must have dis-tinct colors.
Goal: try to minimize the largest diameter of clusters.
Problems and definitions
![Page 18: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/18.jpg)
9-2
Recall `-diversity: Given a set V of points in a metricspace, each of which has a color, subject to:
1. each cluster has size ≥ l;
2. all points partitioned into one cluster must have dis-tinct colors.
Goal: try to minimize the largest diameter of clusters.
Problems and definitions
Given V , construct G(V,E) with
• each v ∈ V has a color c(v);
• for each pair of points u,v with distinct colors, creat anedge e = (u, v) with weight w(e), which is the distanceof u, v in the metric space.
• diameter of C ⊆ V : maxu,v∈C(w(e(u, v))).
![Page 19: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/19.jpg)
10-1
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
![Page 20: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/20.jpg)
10-2
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
spanning star forest: a star forestwith no isolated point.
![Page 21: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/21.jpg)
10-3
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
spanning star forest: a star forestwith no isolated point.
![Page 22: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/22.jpg)
10-4
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
spanning star forest: a star forestwith no isolated point.
semi-valid spanning star forest: a spanning star forest witheach component containing at least ` colors.
` = 3
![Page 23: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/23.jpg)
10-5
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
spanning star forest: a star forestwith no isolated point.
semi-valid spanning star forest: a spanning star forest witheach component containing at least ` colors.
` = 3
![Page 24: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/24.jpg)
10-6
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
spanning star forest: a star forestwith no isolated point.
semi-valid spanning star forest: a spanning star forest witheach component containing at least ` colors.
` = 3
valid spanning star forest: a semi-valid spanning star forestwith each component containing points with distinct colors.
![Page 25: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/25.jpg)
10-7
Definitions (cont.)
star forest: a forest where eachconnected component is a star.
spanning star forest: a star forestwith no isolated point.
semi-valid spanning star forest: a spanning star forest witheach component containing at least ` colors.
` = 3
valid spanning star forest: a semi-valid spanning star forestwith each component containing points with distinct colors.
![Page 26: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/26.jpg)
11-1
A review on 2-approximation for `-anonymity
General idea of the algorithm in Aggarwal el. al. 2006.
1. Let e1, e2, . . . be the edge of G in a non-decreasingorder of weights.
2. Consider each graph Gi formed by the first i edgesEi = {e1, e2, . . . , ei}.
3. For each Gi = (V,Ei), try to
![Page 27: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/27.jpg)
11-2
A review on 2-approximation for `-anonymity
General idea of the algorithm in Aggarwal el. al. 2006.
1. Let e1, e2, . . . be the edge of G in a non-decreasingorder of weights.
2. Consider each graph Gi formed by the first i edgesEi = {e1, e2, . . . , ei}.
3. For each Gi = (V,Ei), try to
find a maximal independent set (IS) I s.t.
(1) there is a spanning star forest in Gi with thenodes in I being the star centers,
(2) each star has at least ` nodes.
![Page 28: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/28.jpg)
11-3
A review on 2-approximation for `-anonymity
General idea of the algorithm in Aggarwal el. al. 2006.
1. Let e1, e2, . . . be the edge of G in a non-decreasingorder of weights.
2. Consider each graph Gi formed by the first i edgesEi = {e1, e2, . . . , ei}.
3. For each Gi = (V,Ei), try to
find a maximal independent set (IS) I s.t.
(1) there is a spanning star forest in Gi with thenodes in I being the star centers,
(2) each star has at least ` nodes.
Let the diameter of the OPT clustering be d∗ withw(ei∗) = d∗. Aggarwal el. al. shows that the trial onGi∗ must succeed.
![Page 29: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/29.jpg)
11-4
A review on 2-approximation for `-anonymity
General idea of the algorithm in Aggarwal el. al. 2006.
1. Let e1, e2, . . . be the edge of G in a non-decreasingorder of weights.
2. Consider each graph Gi formed by the first i edgesEi = {e1, e2, . . . , ei}.
3. For each Gi = (V,Ei), try to
find a maximal independent set (IS) I s.t.
(1) there is a spanning star forest in Gi with thenodes in I being the star centers,
(2) each star has at least ` nodes.
Let the diameter of the OPT clustering be d∗ withw(ei∗) = d∗. Aggarwal el. al. shows that the trial onGi∗ must succeed.
⇒ 2-approximation (SOL ≤ 2d∗)
![Page 30: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/30.jpg)
12-1
Our algorithm for `-diversity
Our algorithm for `-diversity follows the same frame-work but we try to
find a maximal IS I s.t.
there is a valid spanning star forest in Gi with thenodes in I being the star centers.
![Page 31: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/31.jpg)
12-2
Our algorithm for `-diversity
Our algorithm for `-diversity follows the same frame-work but we try to
find a maximal IS I s.t.
there is a valid spanning star forest in Gi with thenodes in I being the star centers.
We can also prove that the algorithm will succeed on Gi∗ ,⇒ 2-approximation.
![Page 32: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/32.jpg)
12-3
Our algorithm for `-diversity
Our algorithm for `-diversity follows the same frame-work but we try to
find a maximal IS I s.t.
there is a valid spanning star forest in Gi with thenodes in I being the star centers.
We can also prove that the algorithm will succeed on Gi∗ ,⇒ 2-approximation.
The additional challenge: nodes in a star havedistinct colors.
![Page 33: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/33.jpg)
13-1
Two tests
S-TEST(Gi, I): check if there exists a semi-valid span-ning star forest in Gi with nodes in I being star centers.
⇒S-TEST
![Page 34: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/34.jpg)
13-2
Two tests
S-TEST(Gi, I): check if there exists a semi-valid span-ning star forest in Gi with nodes in I being star centers.
V-TEST(Gi, I): check if there exists a valid spanningstar forest in Gi with nodes in I being star centers.
⇒⇒S-TEST V-TEST
![Page 35: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/35.jpg)
14-1
The 2-approximate algorithm
1. Let I be an arbitrary maximal IS in Gi
2. While S-TEST(Gi, I) is passed
(a) (S, S′)← V-TEST(Gi,I) /* S ⊆ V − I, S′ ⊆ I */
(b) If S = ∅ then Succeed; else
i. I ← I − S′ + S/* |S′| < |S|, |I| increase and I is still an IS */
ii. Add nodes to I until it is a maximal IS
3. Fail;
We just perform the trial on graphs G1, G2, . . . one byone, until we find on some Gi a valid spanning forestwith a maximal independent set as centers.
![Page 36: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/36.jpg)
14-2
The 2-approximate algorithm
1. Let I be an arbitrary maximal IS in Gi
2. While S-TEST(Gi, I) is passed
(a) (S, S′)← V-TEST(Gi,I) /* S ⊆ V − I, S′ ⊆ I */
(b) If S = ∅ then Succeed; else
i. I ← I − S′ + S/* |S′| < |S|, |I| increase and I is still an IS */
ii. Add nodes to I until it is a maximal IS
3. Fail;
We just perform the trial on graphs G1, G2, . . . one byone, until we find on some Gi a valid spanning forestwith a maximal independent set as centers.
redistribute pointsandpick new centers
![Page 37: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/37.jpg)
14-3
The 2-approximate algorithm
1. Let I be an arbitrary maximal IS in Gi
2. While S-TEST(Gi, I) is passed
(a) (S, S′)← V-TEST(Gi,I) /* S ⊆ V − I, S′ ⊆ I */
(b) If S = ∅ then Succeed; else
i. I ← I − S′ + S/* |S′| < |S|, |I| increase and I is still an IS */
ii. Add nodes to I until it is a maximal IS
3. Fail;
We just perform the trial on graphs G1, G2, . . . one byone, until we find on some Gi a valid spanning forestwith a maximal independent set as centers.
Claim: Both tests succeed on Gi∗ .⇒ ∃ a valid spanning star forest on Gi∗
![Page 38: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/38.jpg)
15-1
Lowerbound
Theorem: There is no polynomial-time approximation al-gorithm for `-diversity that achieves an approximation fac-tor less than 2 unless P = NP.
By reduction to 3D-matching.
![Page 39: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/39.jpg)
15-2
Lowerbound
Theorem: There is no polynomial-time approximation al-gorithm for `-diversity that achieves an approximation fac-tor less than 2 unless P = NP.
By reduction to 3D-matching.
Holds even there are only 3 colors.
If there are 2 colors, the problem can be solved in polynomialtime by finding perfect matchings in the G1, G2, . . ..
![Page 40: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/40.jpg)
16-1
The infeasible case (high-level sketch)
If some color has more than bn/lc points, there is no fea-sible solution, thus we must remove some points.
Least number of points that should be removed to get afeasible solution can be computed, say, k points.
Goal: Compute an optimal clustering by deleting k points.
![Page 41: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/41.jpg)
16-2
The infeasible case (high-level sketch)
If some color has more than bn/lc points, there is no fea-sible solution, thus we must remove some points.
Least number of points that should be removed to get afeasible solution can be computed, say, k points.
Goal: Compute an optimal clustering by deleting k points.
Additional challenge: have to decide (even know k):
which points should be deleted?
![Page 42: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/42.jpg)
16-3
The infeasible case (high-level sketch)
If some color has more than bn/lc points, there is no fea-sible solution, thus we must remove some points.
Least number of points that should be removed to get afeasible solution can be computed, say, k points.
Goal: Compute an optimal clustering by deleting k points.
Additional challenge: have to decide (even know k):
which points should be deleted?
Our strategy: to find
a valid star forest spanning n− k nodes in G28i∗ .
⇒ 56-approximation
![Page 43: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/43.jpg)
17-1
Futher work
Try to minimize the sum of the diameters of the clusters.
![Page 44: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/44.jpg)
17-2
Futher work
Try to minimize the sum of the diameters of the clusters.
Design approximation algorithms for the problem byremoving any fixed number of points.
![Page 45: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/45.jpg)
17-3
Futher work
Try to minimize the sum of the diameters of the clusters.
Design approximation algorithms for the problem byremoving any fixed number of points.
Consider other instance-level constraints like the gen-eral Must-Link and Cannot-Link constraints,
![Page 46: Clustering with Diversityyike/clusterslides.pdf · Recall ‘-diversity: Given a set V of points in a metric space, each of whichhas a color, subject to: 1.each cluster has size l;](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0cfb257e708231d4381650/html5/thumbnails/46.jpg)
18-1
The End
T HANK YOU
Q and A