cluster description and related problems ddm lab seminar 2005 spring byron gao
Post on 18-Dec-2015
213 views
TRANSCRIPT
Cluster Description and Related Problems
DDM Lab Seminar
2005 Spring
Byron Gao
2
Road Map
1. Motivations
2. Cluster Description Problems
3. Related Problems in Machine Learning
4. Related Problems in Theory
5. Related Problems in Computational Geometry
6. Conclusions and Past / Future Work
3
Clustering
Group objects into clusters such that objects within a cluster are similar and objects from different clusters are dissimilar
– one of the major data mining tasks– K-means, hierarchical, density-based…
1. Motivations
4
Cluster Descriptions
Literature lacks systematic study of cluster descriptions– Most clustering algorithms just give membership assignments– Hidden knowledge in data need to be represented for inference– Integration of database and data mining
Purposes of cluster descriptions– Summarize data to gain initial knowledge– Compress data supporting further investigation
One way of describing a cluster: DNF formula– A set of isothetic hyper-rectangles
Interpretable Used as search condition to retrieve objects
1. Motivations
5
Gain initial and basic understanding of the clusters– Capture shape and location in the multi-dimensional space
Requirement: interpretability– Simple and clear structure of description format
e.g., DNF, B-DNF… B-DNF: Set difference between the bounding box B and a DNF formula describing objects in B but not in cluster C
– Short in length
Another requirement: accuracy– Interpreting the correct target– May trade accuracy for shorter length
Summarize Data
1. Motivations
6
Compress Data (1)
Data compression has important applications in data
transmission and data storage
– Transmission: distribute clustering results to branches Clustering performed interactively by expert at head office
– Storage: support further investigation Partial results or results (in some sense, always partial results)
Description process ≈ encoding (compression)
Retrieval process ≈ decoding (decompression)
– Description as search condition in SELECT query statement
1. Motivations
7
Requirement: compression ratio & retrieval efficiency– Short in length
comp. ratio = |C| / (|DESC| * 2) Typically < 0.01
Another requirement: accuracy– Retrieve desired objects– May trade accuracy for shorter length: lossy compression / faster retrieval
Compress Data (2)
1. Motivations
8
Compress Data (3)
Support Further investigation– Exploratory queries without retrieval of original objects
e.g.,… Incremental clustering
– Exploratory queries requiring retrieval of original objects Information not associated with the description Information not revealed by the description Interactive and iterative mining environment
– Inductive database framework
1. Motivations
9
SDL Problem
Shortest Description Length (SDL) problemGiven a cluster C, its bounding box B, and a description format, obtain a logical expression DESC in the given format with minimum length such that for any object o, (o C o DESC) (o B – C o DESC)
Lossless NP-hard
– variant of the Minimum Set Cover problem
2. Cluster Description Problems
10
MDQ Problem
Maximum Description Quality (MDQ) problemGiven a cluster C, its bounding box B, an integer l, a description format, and a quality measure, obtain a logical expression DESC in the given format with length at most l such that the quality measure is maximized
Lossy NP-hard:
– reducible to the Set Cover problem
2. Cluster Description Problems
11
MDQ Problem: Quality Measure
F-measure F = 2PR / (P + R)– Harmonic mean of P and R– R = |Cdescribed| / |C|
– P = |Cdescribed| / |Bdescribed| where Cdescribed (Bdescribed) is the set of objects in C (B) that
is described by the description
RP=1 (Recall at fixed Precision of 1)
PR=1 (Precision at fixed Recall of 1)
2. Cluster Description Problems
12
Concept Learning
Concept learning: find a classifier for a concept– Given a labeled training sequence, generate a hypothesis
that is a good approximation of the target concept and can be used as a predictor for other query points
Typical machine learning problem– Most extensively studied problem in computational learning theory
Representation of target concept assumed known– DNF formulas, Boolean circuits, geometric objects, pattern
languages, deterministic finite automata, etc…
3. Related Problems in Machine Learning
13
Geometric Concept Learning
Target is a geometric object– Half spaces. Separated by linear hyper planes
perceptron learning
– Intersections of half spaces. Convex polyhedra particularly, axis-parallel box
– More accurate yet simple enough to offer intuitive solutions
– Unions of axis-parallel boxes e.g., 2 boxes. Less studied
Popular topic in computational learning theory
3. Related Problems in Machine Learning
14
Learning Models
PAC learning model: Probably Approximately Correct [15]
– Given a reasonable number of labeled instances, generate with high probability a hypothesis that is a good approximation of the target concept within a reasonable amount of time
Exact Learning Model [2]
– The learner is allowed to pose queries in order to exactly identify the target concept within a reasonable amount of time
– Common queries: answered by an oracle Membership query: feed instance to oracle, return its classification Equivalence query: present hypothesis to oracle, return “correct” or a
counterexample
3. Related Problems in Machine Learning
15
Similarities
Learn unions of axis-parallel boxes
– DNF formula
Learner is fed with positive and negative examples
which need to be classified accurately
3. Related Problems in Machine Learning
16
Dissimilarities (1) : objective
Geometric concept learning– exact or approximate identification of geometric object
related to geometric object construction in computational geometry– Ultimate goal is the classifying accuracy over the instance space
labeled instances are a small part of the instance space– Optimization: least disagreement problem
fixed number of boxes, usually 1 or 2 “+” and “–” examples are treated equally
Cluster description problems– Entire instance space is given– To describe “+” examples instead of a classify “+” and “–” examples– Optimization problem: SDL or MDQ problems.
For quality measure, “+” and “–” examples are not treated equally Examples could be labeled “1”, “5”, “9”... for clustering description
3. Related Problems in Machine Learning
17
Dissimilarities (2) : research focus
Geometric concept learning– Learnability
Under certain learning models
– Learning from queries (membership, equivalence) rather than from random examples to significantly improve performance
– Low-dimensional
Cluster description problems– Heuristics– Large data set, possibly high-dimensional– Other concerns: compression tool, retrieval efficiency etc.
3. Related Problems in Machine Learning
18
Two Traditional Problems
Minimum Set Cover problem [10]
– Given a collection S of subsets of a finite ground set X, find S’ S with minimum cardinality such that U Si S’ Si = X
– Approximable within ln|S| [8]: greedy set cover Iteratively picks up the subset that covers the maximum number of
uncovered elements
Maximum Coverage problem (max k-cover): a variant problem– Maximize the number of covered elements in X using at most k
subsets from S.– Approximable within e / (e – 1) by the greedy algorithm [8]
The ratios are optimal [6]
4. Related Problems in Theory
19
Red Blue Set Cover Problem
Given a finite set of “red” elements R, a finite set of “blue”
elements B and a family S 2RUB, find a subfamily S’ S which
covers all blue elements and minimum number of red elements
Raised recently (00’) [4]
Related to Group Steiner tree, minimum monotone satisfying
assignment and minimum color path problems
Reducible to the set cover problem (at least as hard as) [4]
4. Related Problems in Theory
20
SDL Problem is NP-hard
a variant of the Minimum Set Cover problem– Ground set X = set of objects in C, the cluster
being described– Collection S of rectangles (subsets):
Each Sj S is a rectangle minimally covering some objects in C and none objects not in C
S 2C
– Find S’ S with minimum cardinality such that
U Si S’ Si = C
4. Related Problems in Theory
21
MDQ Problem is NP-hard
RP=1: a variant of the Maximum Coverage problem
– In the same fashion as we showed in last slide
– Given k, find S’ S with |S’| at most k such that the number
of covered elements in C is maximized
PR=1: a variant of the Red Blue Set Cover problem
F-measure: reducible to either of the two
4. Related Problems in Theory
22
Similarities and Dissimilarities
Set Cover problem and its variants arise in a variety of settings
Useful to argue NP-hardness of SDL and MDQ problems
Not useful from algorithm design point of view because of the
essential differences: in SDL and MDQ problems, the collection
of subsets S is not explicitly given
– e.g., the greedy set cover algorithm cannot be applied
4. Related Problems in Theory
23
Maximum Box Problem
Given two finite sets of points X + and X - in d, find an axis-parallel box B such that B ∩ X - = and the total number of points from X + covered is maximized
Raised recently (02’) [5] NP-hard when d is part of the input Studied by [12] [14] for d = 2 finding exact solutions
Corresponds to MDQ problem with k = 1 and RP=1
Techniques for low-dimensional and small datasets
5. Related Problems in Computational Geometry
24
Rectilinear Polygon Covering Problem
Find a collection of axis-parallel rectangles with minimum cardinality whose union is exactly to a given rectilinear polygon
– Covering rectilinear polygon with axis-parallel rectangles with holes [7] (79’)– Different from partition, overlapping is allowed– 2-d
NP-complete [13] No possibility of polynomial time approximation scheme [3] A special case of the general set covering problem
– performance guarantee O(log n)– Recent result O(√ log n) [9]
5. Related Problems in Computational Geometry
25
Similarities and Dissimilarities
If extended to higher-dimensional space, exactly the same formulation as in Greedy Growth [1]
If further allowing “don’t care” cells, similar formulation as in Algorithm BP [11]
If not restricted to grid data, same as SDL problem
[1] and [11] are related studies in DB literature Logic minimization is also related to the grid case
– Most researches are on Boolean variables
5. Related Problems in Computational Geometry
26
Conclusions
Cluster (clustering) description is of fundamental importance to KDD, but not systematically studied
Formulated problems have significant differences to related problems in machine learning, theory, computational geometry and DB literatures.
6. Conclusions and Past / Future Work
27
Past Work
Formulated and provided heuristic algorithms for
the SDL and MDQ problem– for both DNF and B-DNF formats
– work for both vector and grid data
Studied efficient retrieval issue– cost model
– efficiency estimation methodology
– optimal ordering of description terms
6. Conclusions and Past / Future Work
28
Future Work
Optimization of algorithms: focused on introducing concepts and ideas
Real data experiments
Experimental validation of the retrieval efficiency issue: DBMS
Description alphabet: always rectangles?
– if not, getting farther away from some related problems
Clustering description: gaining quality and efficiency
– even more distinguishable from related problems
Incremental description: description process could take long
Interaction with clustering algorithms: online description?
6. Conclusions and Past / Future Work
29
References:
1. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD 1998.
2. D. Angluin. Queries and concept learning. Machine Learning, 2(4): 319-342, 1988.
3. P. Berman and B. Dasgupta. Approximating rectilinear polygon cover problems. Algorithmica, 17: 331-356, 1997.
4. R.D. Carr, S. Doddi, G. Konjevod and M. Marathe. On the red-blue set cover problem. SODA 2000.
5. J. Eckstain, P. Hammer, Y. Liu, M. Nediak, and B. Simeone. The maximum box problem and its application to data analysis. Computational Optimization and Applications, 23(3): 285-298, 2002.
6. U. Feige. A threshold of ln n for approximating set cover. J.ACM 45(4): 634-652. 1998.
7. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York, 1979.
8. D.S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems. PWS, New York, 1997.
9. V.S.A. Kumar and H. Ramesh. Covering rectilinear polygons with axis-parallel rectangles. SIAM J. Comput. 32(6): 1509-1541, 2003.
10. D.S. Johnson. Approximation algorithms for combinatorial problems. J.Comput.System Sci., 9: 256-278, 1974.
11. L.V.S. Lakshmanan, R.T. Ng, C.X. Wang, X. Zhou and T.J. Johnson. The generalized MDL approach for summarization. VLDB 1999.
12. Y. Liu and M. Nediak. Planar case of the maximum box and related problems. CCCG 2003.
13. W.J. Masek. Some NP-Complete Set Covering Problems, manuscript, MIT, Cambridge, MA, 1979.
14. M. Segal. Planar maximum box problem. Journal of Mathematical Modelling and Algorithms, 3: 31-38, 2004.
15. L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, 1984.
30
Questions?