linear and non linear dimensionality reduction for distributed knowledge discovery panagis...

97
Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel Yannakoudakis, Yannis Kotidis Athens University of Economics and Business Athens, 31 st of May 2010

Upload: norah-harrington

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery

Panagis Magdalinos

Supervising Committee:Michalis Vazirgiannis, Emmanuel Yannakoudakis, Yannis Kotidis

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 2: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Outline Introduction – Motivation Contributions FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm

A new dimensionality reduction algorithm Large scale data mining with FEDRA

A Framework for Linear Distributed Dimensionality Reduction Distributed Non Linear Dimensionality Reduction

Distributed Isomap (D-Isomap) Distributed Knowledge Discovery with the use of D-Isomap

An Extensible Suite for Dimensionality Reduction Conclusions and Future Research Directions

2/70Athens University of Economics and BusinessAthens, 31st of May 2010

Page 3: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Motivation Top 10 Challenges in Data Mining1

Scaling Up for High Dimensional Data and High Speed Data Streams Distributed Data Mining

Typical examples Banks all around the world World Wide Web Network Management

More challenges are envisaged in the future Novel distributed applications and trends

Peer-to-peer networks Sensor networks Ad-hoc mobile networks Autonomic Networking

Commonality : High dimensional data in massive volumes.

Athens University of Economics and BusinessAthens, 31st of May 2010

1. Q.Yang and X.Wu: “10 Challenging Problems in Data Mining Research”, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604

3/70

Page 4: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The curses of dimensionality Curse of dimensionality

Empty space phenomenon Maximum and minimum distance of a dataset tend to be equal

as dimensions grow (i.e., Dmax – Dmin ≈ 0)

Data mining becomes resource intensive K-means and k-nn are typical examples

R1 21 R2 22 R3 23 R4 24

Athens University of Economics and BusinessAthens, 31st of May 2010

4/70

Page 5: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Solutions Dimensionality reduction

MDS, PCA, SVD, FastMap, Random Projections… Lower dimensional embeddings while enabling the subsequent addition of new

points.

The curse of dimensionality Significant reduction in the number of dimensions. We can project from 500 dimensions to 10 while retaining cluster structure.

The empty space phenomenon Meaningful results from distance functions k-NN classification quality almost doubles when projecting from more than 20000

dimensions to 30.

Computational requirements Distance based algorithms are significantly accelerated. k-Means converges to less than 40 seconds while initially required almost 7

minutes.

Athens University of Economics and BusinessAthens, 31st of May 2010

5/70

Page 6: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

ClassificationProblems

Hard Problems Significant reductionSoft Problems Milder requirementsVisualization Problems

MethodsLinear and Non LinearExact and ApproximateGlobal and LocalData Aware and Data Oblivious

Athens University of Economics and BusinessAthens, 31st of May 2010

6/70

Page 7: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Quality AssessmentDistortion:

Provision of an upper and lower bound to the new pairwise distance. The new distance is provided as a function of the initial distance:

(1/c1)D(a,b)≤ D’(a,b) ≤ c2D(a,b) , c1, c2 > 1

Good method min(c1c2)

Stress Distortion might be misleading Stress quantifies the distance distortion on a particular example. Stress = √∑(d(Xi,Xj)-d(X’i,X’j))2/∑d(Xi,Xj)2

Task Related Metric Clustering/Classification Quality Pruning Power Computational Cost

VisualizationAthens University of Economics and Business

Athens, 31st of May 20107/70

Page 8: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Contributions Definition of a new, global, linear, approximate dimensionality

reduction algorithm Fast and Efficient Dimensionality Reduction Algorithm (FEDRA) Combination of low time and space requirements together with high quality

results Definition of a framework for the decentralization of any landmark

based dimensionality reduction method Motivated by low memory requirements of landmark based algorithms Applicable in various network topologies

Definition of the first distributed, non linear, global approximate dimensionality reduction algorithm Decentralized version of Isomap (D-Isomap) Application on knowledge discovery from text collections

A prototype enabling the experimentation with dimensionality reduction methods (x-SDR) Ideal for teaching and research in academia

Athens University of Economics and BusinessAthens, 31st of May 2010

8/70

Page 9: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm

Based on :•P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm", In Proceedings of the SIAM International Conference on Data Mining (SDM'09), Sparks Nevada, USA, May 2009. •P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "Enhancing Clustering Quality through Landmark Based Dimensionality Reduction ", Accepted with revisions in the Transactions on Knowledge Discovery from Data, Special Issue on Large Scale Data Mining – Theory and Applications.

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 10: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The general idea Instead of trying to map the whole dataset in the new space

Extract a small fraction of data and embed it in the new space Create the “kernel” around which the whole dataset is going to be placed

Minimize the loss of the information during the first part of the process. Project each remaining point independently by taking into account only the

initial set of sampled data.

The formulation of this idea into a coherent algorithm resulted in the definition of FEDRA (Fast and Efficient Dimensionality Reduction Algorithm)

A global, linear, approximate, landmark based method

Y

ZP4

X

P3

P2

P1

Y

P4

XP3

P2

P1

Athens University of Economics and BusinessAthens, 31st of May 2010

10/70

Page 11: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Our goal Formulate a method which combines:

Results of high quality Minimum space requirements Minimum time requirements Scalability in terms of cardinality and dimensionality

Application Hard dimensionality reduction problems

Projecting from 500 dimensions to 10 while retaining inter-objects relations

Enabling faster convergence of k-Means

Top 10 Challenge: Scaling up for high dimensional data

Athens University of Economics and BusinessAthens, 31st of May 2010

11/70

Page 12: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The FEDRA AlgorithmInput: Projection Dimensionality (k), Original Distances in Rn (D),

Distance Metric (p)

Output: New Dataset in Rk (P’)

1. L Select k points and populate set of landmarks

2. L’ Project all landmarks in the target space by requiring that ||L’i – L’j||p = ||Li – Lj||p for 1≤i,j≤k

3. P’L’

4. For each of the non landmark points X

4.1 X’ Obtain the projection of X by requiring that ||L’i – X’||p = ||Li – X||p for 1≤i≤k

4.2 P’P’UX’

5 return P’

How do we select landmarks?

Does this system of equations has a solution?

Does the algorithm converge? Isn’t it time consuming?

Does this simplification come at a cost?

These are the questions that we will answer in the next couple of slides

Athens University of Economics and BusinessAthens, 31st of May 2010

12/70

Page 13: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The theory underlying FEDRA Theorem 1: A set of k+1 points, pi i=1…k+1, described only by their pairwise

distances which have been defined with the use of a Minkowski distance metric p, can be embedded in Rk without distortion. Their coordinates can be derived in polynomial time through the following set of equations: if j<i-1 then p’i,j is given by the single root of

|p’i,j|p - |p’i,j–p’j+1,j|p + ∑f=1j-1|p’i,f|p-∑f=1

j-1|p’i,f-p’j,f|p + dp(pj+1,pi)p – dp(pi,p1)p = 0 if j=i-1 p’i,j =(dp(pi,p1)p - ∑f=1

i-2|p’i,f|p)1/p

0 otherwise

Theorem 2: Any equation of the form f(x)=|x|p–|x-a|p–d where aЄR\{0}, dЄR, pЄN\{0} has a single root in R.

if -1< v=d/|a|p <1 the root lays in (0, a) otherwise the root lays in (a,|v|a)

The cost of embedding the k landmarks is ck2/2 where c is the cost of the Newton-Raphson method (for p=2 c=1)

Athens University of Economics and BusinessAthens, 31st of May 2010

13/70

Page 14: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Theorem 1 in practice (1/2)

No distortion requires that ||Pi’ - Pj

’||(p)= ||Pi - Pj||(p), i,j=1..4

First point is mapped as P’1 = O = {0,0,0} Second point is mapped at P’2 = {||P2 – P1||(p),0,0} Third points should satisfy simultaneously

||P’3 – P’1||(p)= ||P3– P1||(p)

||P’3 – P’2||(p)= ||P3– P2||(p)

The solution is the intersection of the circles

P1

Y

Z

P1

Y

Z

P2

Y

Z

X X X

P2P1

P3

Athens University of Economics and BusinessAthens, 31st of May 2010

14/70

Page 15: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Theorem 1 in practice (2/2)

Fourth point should satisfy simultaneously

||P’4 – P’1||(p)= ||P4– P1||(p)

||P’4 – P’2||(p)= ||P4– P2||(p)

||P’4 – P’3||(p)= ||P4– P3||(p)

Three intersecting spheres. The intersection of two spheres is a circle.Consequently we search for the intersection of a circle with a sphere.

Y

Z

Y

Z

P4

X X

P3

P2P1

P3

P2P1

Athens University of Economics and BusinessAthens, 31st of May 2010

15/70

Page 16: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Reducing Time Complexity (1/2) Simplified through the following iterative scheme

The embedding of Xi in Rk given the embeddings of Pj , j = 1..i-1

|x’i,1|p + |x’i,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P1 - Xi||p

|x’i,1-p’2,1|p + |x’i,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P2 - Xi||p

|x’i,1-p’3,1|p + |x’i,2-p’3,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P3 - Xi||p

……………………………………………………………………………………….|x’i,1-pi-1,1|p + |x’i,2-pi-1,2|p + |x’i,3-pi-1,3|p +….+ |x’i,i-1|p = ||Pi-1 - Xi||p

Note that by subtracting the second equation from the first we derive|x’i,1|p - |x’i,1-p’2,1|p - ||P1 - Xi||p + ||P2 - Xi||p=0

The equation has a single unknown and a single root x’i,1

In general, the value of the i-th coordinate is derived by subtracting the (i+1)-th equation from the first.

Athens University of Economics and BusinessAthens, 31st of May 2010

16/70

Page 17: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Reducing Time Complexity (2/2)By subtracting the i-th equation from the first we essentially calculate the corresponding coordinate (i.e. a plane in R3).The intersection of the k-1 planes corresponds to a line.The first equation is satisfied by points P1,P2 that correspond to the intersection of the line with the norm-sphere of R3.

We lower time complexity from O(ck2) to O(ck) or even O(k) when p=2What if the intersection of the line with the sphere does not exist?

O

X

Y

Z

X=a & Y=b

P1

P2

O

X

Y

Z

X=a

Y=b

X=a & Y=b

O

X

Y

Z

X=a

Athens University of Economics and BusinessAthens, 31st of May 2010

17/70

Page 18: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Existence of solution

Theorem 3: For any non-linear system of equations defined by FEDRA, there always exists at least one solution, provided that the triangular inequality is sustained in the original space.

No convergence ||OA’|| + ||A’L’1|| < ||O’L’1||

Theorem 1 guarantees that ||O’A’||=||OA|| , ||A’L’1||= ||AL1||, ||

O’L’1||=||OL1|| Triangular inequality is not sustained in

the original space

R3

R2

X

Athens University of Economics and BusinessAthens, 31st of May 2010 18/70

Page 19: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The FEDRA AlgorithmInput: Projection Dimensionality (k), Original Distances in Rn (D),

Distance Metric (p)

Output: New Dataset in Rk (P’)

1. L Select k points and populate set of landmarks

2. L’ Project all landmarks in the target space by applying Theorem 1 and its accompanying methodology

3. P’L’

4. For each of the non landmark points X

4.1 X’ Obtain the projection of X by applying Theorem 1 and its accompanying methodology

4.2 P’P’UX’

5 return P’

How do we select landmarks?

Does this system of equations has a solution?Yes, always!

Does the algorithm converge?Yes, always!

Isn’t it time consuming?No! In fact it is only O(k) per point!

Does this simplification come at a cost?

Still some questions remain…

Athens University of Economics and BusinessAthens, 31st of May 2010

19/70

Page 20: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

FEDRA requirementsFEDRA requirements in terms of time and space

Exhibits low memory requirements combined with low computational complexity Memory: O(k2), k: lower dimensionality Time: O(cdk), d: number of objects, c: constant Addition of new point : O(ck)

Achieved by relaxing the original requirements and requesting that every projected point retains unaltered k distances to other data points

Advantageous features Operates on similarity/dissimilarity matrix Applicable with any Minkowski distance metric

FEDRA can provide a mapping from Lnp to Lk

p where p≥1

Athens University of Economics and BusinessAthens, 31st of May 2010

20/70

Page 21: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Distortion

Theorem 4: Using any two landmarks L1, L2, FEDRA can project any two points A,B while guaranteeing that their new distance A’B’ will be bounded according to: AB2 -4AAyBBy ≤ A’B’ 2 ≤ AB2 + 4AAyBBy

Alternatively: A’B’2=AB2 -2BL1AL1(cos(A’L’1B’)-cos(AL1B)) Distortion = √(AB2+4AAyBBy)/(AB2-4AAyBBy) For any Minkowski distance metric p:

ABp -Δ ≤ A’B’p ≤ ABp +Δ Δ = 2BBy∑k=1

p(AAy+BBy)p-k(AAy-BBy)k-1

Does this simplification come at a cost?The distance distortion is low and upper bounded

Ay

A

L2

L1

ByB

A”

Ay’

A’

L1’L2’

B’

By’

E

R

Ay’

A’

L1’L2’ By’

B’

E

OR

Athens University of Economics and BusinessAthens, 31st of May 2010

21/70

Page 22: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Landmarks selection Based on the former analysis it can be proved that the ideal landmark set

should satisfy for any two landmarks Li, Lj and any point A, one of the following relations: LiA ≈ LjA – LiLj (or simply that LiLj ≈ 0 )

LjA ≈ LiLj – LiA

LiA ≈ LjA – LiLj requires the creation of a compact “kernel” where landmarks exhibit minimum distances from each other

LjA ≈ LiLj – LiA requires that cluster centroids are chosen as the landmarks

So if random selection is not acceptable we use a set of k landmarks that exhibit minimum distance from each other.

How do we select landmarks?: Either randomly or heuristically according to theory.

A CL1

L2

L1

L2

A

Athens University of Economics and BusinessAthens, 31st of May 2010

22/70

Page 23: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Ameliorating projection quality (I)Depending on the properties of the selected landmarks set, a -single- case of failure may rise1

O

X

Y

O

X

Y

Z

L1

L2

Cluster A

Cluster B

L1 L2

Cluster A

Cluster B

O

X

Y

L1 L2

Clusters A and B

Athens University of Economics and BusinessAthens, 31st of May 2010

1. V.Athitsos, J.Alon, S.Sclaroff, G.Kollios, “BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval”, IEEE Transactions on PAMI, Vol 30, No.1 January 2008

23/70

Page 24: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Ameliorating projection quality (II)What if we sample an additional set of points and use it as for enhancing projection quality?

Zero distortion from the landmark points and minimum distortion from another k points.

O

X

Y

O

X

Y

Z

L1

L2

Cluster A

Cluster B

L1 L2

O

X

Y

L1 L2

Cluster A

Cluster B

Does this simplification come at a cost? The distance distortion is low and upper bounded. Moreover the projection of a point can be determined using the already projected non landmark points

Athens University of Economics and BusinessAthens, 31st of May 2010

24/70

Page 25: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

FEDRA ApplicationsThe purpose of the conducted experimental evaluation process is:

Highlight the efficiency and effectiveness of FEDRA on hard dimensionality reduction problems

Highlight FEDRA’s scaling ability and applicability in large scale data mining Showcase the enhancement of a typical data mining task like clustering due

to the application of FEDRA

Dataset Cardinality n Classes k Description

Ionosphere 351 34 2 3:1:7 Radar Observations

Segmentation 2100 19 7 3:1:7 Image Segmentation Data

Musk 476 166 2 3:3:15 Molecules Data

Synthetic Control 600 60 6 3:1:7 Synthetic Dataset

Alpha 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Beta 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Gamma 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Delta 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Athens University of Economics and BusinessAthens, 31st of May 2010

25/70

Page 26: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

MetricsWe assess the quality of FEDRA through the following metrics

Stress √∑(d(Xi,Xj)-d(X’i,X’j))2/∑d(Xi,Xj)2

Clustering quality maintenance defined as Quality in Rk/ Quality in Rn

Clustering quality: Purity = (1/N) ∑i,j=1amax(|Ci∩Sj|)

Time requirements for each algorithm to produce the embedding Time requirements for k-Means to converge

We compare FEDRA with Landmark-based Methods Landmark MDS Metric Map Vantage Objects

As well as prominent methods such as PCA FastMap Random Projection

Athens University of Economics and BusinessAthens, 31st of May 2010

26/70

Page 27: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

27 /81

Stress evolution

Dataset: segmentation Dataset: ionosphere

Page 28: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Purity evolution

Experimental analysis indicates:FEDRA exhibits behavior similar to landmark based approaches and slightly ameliorates clustering quality

Athens University of Economics and BusinessAthens, 31st of May 2010

28/70

Dataset: alpha Dataset: beta

Page 29: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

29 /81

Time Requirements

Dataset: alpha Dataset: beta

Page 30: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Experimental analysis indicates:k-Means converges slower on the dataset of Vantage Objects FEDRA reduces k-Means convergence requirements

k-Means Convergence

Athens University of Economics and BusinessAthens, 31st of May 2010

30/70

324secs296secs

Dataset: alpha Dataset: beta

Page 31: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Summary

FEDRA is a viable solution for hard dimensionality reduction problems.

Quality of results comparable to PCALow time requirements, outperformed by Random ProjectionLow stress values, sometimes lower than FastMapMaintain or ameliorate original clustering quality, similar behavior to other methods Enables faster convergence of k-Means

Athens University of Economics and BusinessAthens, 31st of May 2010

31/70

Page 32: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Linear Distributed Dimensionality Reduction

Based on :•P. Magdalinos, C.Doulkeridis, M.Vazirgiannis "K-Landmarks: Distributed Dimensionality Reduction for Clustering Quality Maintenance" In Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006. (Acceptance Rate (full papers) 8,8%) •P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "Enhancing Clustering Quality through Landmark Based Dimensionality Reduction ", Accepted with revisions in the Transactions on Knowledge Discovery from Data, Special Issue on Large Scale Data Mining – Theory and Applications.

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 33: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The general idea All landmark based algorithms are applicable in distributed environments The idea is to sample landmarks from all nodes and use them to define

the original landmark set. Then, communicate this set to all nodes.

Peer 1

Peer 2Peer 3

Peer 4

Peer 6

Peer 7

Peer 1

Peer 2Peer 3

Peer 4

Peer 6

Peer 7

Global Landmark Set

Peer 2Peer 3

Peer 4

Peer 6

Peer 7

Global Landmark Set

Athens University of Economics and BusinessAthens, 31st of May 2010

Peer 133/70

Page 34: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Our goal Formulate a method which combines:

Minimum requirements in terms of network resources Immunity to subsequent alterations of the dataset Adaptability to network changes

Top 10 Challenge: Distributed Data Mining

Application Hard dimensionality reduction problems

Projecting from 500 dimensions to 10 while retaining inter-objects relations

Reduction of network resources consumption

State of the art: Distributed PCA Distributed FastMap

Athens University of Economics and BusinessAthens, 31st of May 2010

34/70

Page 35: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Requirements and Candidates Requirements:

There exists some kind of network organization scheme Physical topology Self-Organization

Each algorithm is composed of two parts A centrally executed A decentralized part

Ideal Candidate: Any landmark based dimensionality reduction algorithm Landmark selection process Aggregation of landmarks in a central location Derivation of the projection operator Communication of the operator to all nodes Projection of each point independently

Athens University of Economics and BusinessAthens, 31st of May 2010

35/70

Page 36: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Distributed FEDRA Applying the landmark based paradigm in a network environment

Select landmarks at peer level Communicate all landmarks to aggregator

O(nk) network load Project landmarks and communicate the results

O(nkM +Mk2) network load Each peer projects each point independently

Assuming a fixed number of |L| landmarks then network requirements are upper bounded for each algorithm O(n|L|M+M|L|k)

Landmark based algorithms are less demanding than distributed PCA Distributed PCA: O(Mn2 + nkM) As long as |L| < n

Athens University of Economics and BusinessAthens, 31st of May 2010

36/70

Page 37: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Selecting the landmark points Each peer may select:

k points from the local dataset Select k local points (randomly or heuristically) Transmit them to the aggregator The aggregator receives Mk points from all peers and selects the landmark

set. Network load is O(Mkn + Mk2)

k/M points from the local dataset This implies that the aggregator will inform the peers about the size of the

network The landmarks selection happens only once in the lifetime of the network,

arrivals and departures will have no affect. Network load is O(kn + Mk2)

Zero points from the local set The aggregator selects from the local dataset k landmarks Network load is O(Mk2)

Athens University of Economics and BusinessAthens, 31st of May 2010

37/70

Page 38: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

ApplicationDatasets from the Pascal Large Scale Challenge 2008

500-node network with random connections between elementsNodes are connected with 5% probability

Distributed K-Means (P2P-Kmeans1) approach in order to assess the quality of the produced embedding

Dataset Cardinality n Classes k Description

Alpha 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Beta 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Gamma 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

Delta 500000 500 2 10:10:50 Pascal Large Scale Challenge ‘08

1. S.Datta, C.Giannella, H.Kargupta: Approximate Distributed K-means clustering over a P2P network. IEEE TKDE 2009, vol 21, no10, 10/2009

Athens University of Economics and BusinessAthens, 31st of May 2010

38/70

Page 39: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

39 /81

Dataset: alpha Dataset: beta

Dataset: gamma Dataset: delta

Page 40: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Random Projection deviate from the framework Random Projection: The aggregator identifies the projection matrix

Distributed clustering induces a network cost of more than 10GB Hard dimensionality reduction preprocessing -requiring at most 200MB- reduces

the cost to roughly 1GB.

Network Requirements

Athens University of Economics and BusinessAthens, 31st of May 2010

40/70

Page 41: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

SummaryLandmark based dimensionality reduction algorithms provide a viable solution to distributed dimensionality reduction pre-processing

High quality resultsLow network requirementsNo special requirements in terms of network organizationAdaptability to potential failures

Results obtained in a network of 500 peersDimensionality reduction preprocessing and subsequent P2P-Kmeans application necessitates only 12% of the original P2P-Kmeans loadClustering quality remains the same and slightly ameliorated

Distributed FEDRALow network requirements combined with high quality results

Athens University of Economics and BusinessAthens, 31st of May 2010

41/70

Page 42: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Distributed Non Linear Dimensionality Reduction

Based on :•P.Magdalinos, M.Vazirgiannis, D.Valsamou, "Distributed Knowledge Discovery with Non Linear Dimensionality Reduction", To appear in the Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'10), Hyderabad, India, June 2010. (Acceptance Rate (full paper) 10,2%) •P. Magdalinos, G.Tsatsaronis, M.Vazirgiannis, “Distributed Text Mining based on Non LinearDimensionality Reduction", Submitted to European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under review.

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 43: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Our goal Top 10 Challenges: Distributed data mining of high dimensional data

Scaling Up for High Dimensional Data Distributed Data Mining

Vector Space Model: Each word defines an axis each document is a vector residing in a high

dimensional plane Numerous methods that try to project data in a low dimensional space while

assuming linear dependence between variables. However latest experimental results show that this assumption is incorrect

Application Hard dimensionality reduction and visualization problems

Unfolding a manifold distributed across a network of peers Mining information from distributed text collections

State of the art: None!

Athens University of Economics and BusinessAthens, 31st of May 2010

43/70

Page 44: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The general idea The idea is to replicate the original Isomap algorithm in a highly distributed

environment and still get results of equal quality.

Distributed Isomap: A three phased approach:

Isomap

Peer 1 Peer 2

Peer 3

Peer 4

Peer 6

Peer 7

Peer 8

Peer 1 Peer 2

Peer 3

Peer 4

Peer 6

Peer 7

Peer 8

Distibuted NN and SP algorithms

Peer i

MultidimensionalScaling

Athens University of Economics and BusinessAthens, 31st of May 2010

44/70

Page 45: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Indexing and k-NN retrieval (1/4) Which LSH family to employ1?

Since we use the Euclidean distance we should use an Euclidean distance preservation mapping hx,b = floor(xr+b/w) where x is the data point, r is an 1xn random vector, w in N and b

in [0,w)

This family of functions guarantees that the probability of collision will be analogous to points original distance.

Given f hash function for each table we have an f-dimensional vector

1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1) (2008)

Athens University of Economics and BusinessAthens, 31st of May 2010

1 5 … 7 2 4 … 1

hash1

hashf

hash2

45/70

Page 46: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Indexing and k-NN retrieval (2/4) Indexing and guaranteeing load balancing

Consider the norm-1 of the produced vector, ∑i=1f|hi(x)|

The values are generated from the normal distribution N(f/2,fμ||x||/w)1

Consider 2 standard deviations and split the range into M cells

For a given hash vector v, the peer that will index it is: peerid = (M(||v||1-μl1+2σl1)/4σl1)modM

1. Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. ACM EDBT pp. 744--755 (2009)

1 5 … 7 2 4 … 1

hash1

hashf

hash2

l1=∑|vi| 40μl1-2σl1 2σl1

peeri

……………

Athens University of Economics and BusinessAthens, 31st of May 2010

46/70

Page 47: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Indexing and k-NN retrieval (3/4) How to effectively and efficiently search for the kNN of each point? Baseline: For each local point di

For each table T Find the peer that indexes it Retrieve all points from corresponding bucket

Retrieve actual points Calculate actual distances, rank them and retain k-NNs

What if we could somehow identify a range and upper bound the difference of δ=| ||h(x)||1- ||h(y)||1 |?

Theorem 5: Given f hash functions hi = floor(rixT+bi/w) where ri is an 1xn random vector, w∈N, bi∈[0, w), i = 1...f , the difference δ of the l1 norms of the projections xf ,yf of two points x, y∈Rn is upper bounded by (||A|| ||x-y||)/w, where A= || ∑i=1

f|ri| || and ||x − y|| the points’ Euclidean distance. Although the bound is rather large, it still reduces the required number of

messages

Athens University of Economics and BusinessAthens, 31st of May 2010

47/70

Page 48: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Indexing and k-NN retrieval (4/4)Indexing

Peer 6

Peer 8

1 5 … 7V=

hash(V) = 2 1 … 8

peerid = f(hash(V))

( ||hash(V)||1,6)

l1(hash) Peer

… …

34 6

Hash Table Pid Peer

… … …

2 Vid 8

k-NN Retrieval

Peer 6

Peer 8 1 5 … 7V=

( ||hash(V)||1,6,X)

l1(hash) Peer

… …

34 6

Hash Table Pid l1(hash(V)) Peer

… … … …

2 Vid 34 8

Peer 4

(Peer4,Y,32)

Request-Reply YMessages: O(dT)Time: O(diTfn)Memory: O(fn)

Messages: O(cskd)Time: O(cskdi)Memory: O(cskn)

Athens University of Economics and BusinessAthens, 31st of May 2010

48/70

Page 49: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Geodesic Distances (1/2) At this step, each peer has identified the NN graphs of its points G

(G=Ui=1|Di|Gi )

The target is to identify the SPs from each point to the rest of the dataset

Use best practices from computer networking Distance Vector Routing or Distributed Bellman Ford Assume that each point is a network node and each calculated

distance a link between the corresponding points/nodes

From a node’s perspective, DVR replicates a ranged search, starting with one link and progressively augmenting it by 1

V1 V2 … Vi … Vn

V1

V2

V1 V2 … Vi … Vn

V1

V2

Athens University of Economics and BusinessAthens, 31st of May 2010

49/70

Page 50: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Geodesic Distances (2/2)Peer 1

Peer 2

Peer 3

Peer 4

Peer 5

Start at node 1Discover paths, 1 hop awayDiscover paths, 2 hops awayDiscover paths, 3 hops away

Peer 5 will never be reached! Not connected graph

Distance is ∞ Distance is 5*max(distance)

Graph is now connected

Messages: O(kNNMd2)

Space: O(did) Time: O(M)

Athens University of Economics and BusinessAthens, 31st of May 2010

50/70

Page 51: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Multidimensional Scaling

V1 V2 … Vi … Vn

V1

V2

X1 X2 X3

V1 A B C

V2 C D N

X1 X2 X3

V1 A B C

V2 C D N

… … … …

Vn S A X

At this step, each peer has a fraction of the global matrix.

Instead of calculating the MDS approximate it!

Employ landmark based dimensionality reduction algorithms and Derive the embedding Approximate the whole datasets on peer level! All these, with 0 load!

What if the landmarks are not enough? Employ the approach of distributed FEDRA

Network requirements: O(knM)

Athens University of Economics and BusinessAthens, 31st of May 2010

51/70

Page 52: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Reducing Messages Since we will work only with a small number of landmarks why not

calculate their shortest paths only.

A node is randomly selected and initiates the SP process Selects the required number of landmarks (i.e. a) Initiates the SP algorithm O(adkNNM) messages Communicates results to all nodes O(Ma) messages All nodes execute the landmark based DR algorithm locally

Network cost Base approach O(kNNMd2) Landmark based approach O(adkNNM+Md) Landmark based approach is always cheaper.

D-Isomap in total Messages: O(csdk + dT + adkNNM) Time: O(cskdi + M) + CDLDR Space: O(cskdi + did) + CDLDR

Athens University of Economics and BusinessAthens, 31st of May 2010

52/70

Page 53: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Adding or Deleting points Addition of points:

Hashing and Identification of kNNs Calculation of geodesic distances from landmarks using local information Low dimensional projection using FEDRA, LMDS or Vantage Objects Network Cost: O(cskNN), Time: O(cskNN)+ CDLDR, Memory: O(n+kNN) + CDLDR

Deletion of points: Inform indexing peer that the point is deleted

L1

L2

L3

X1

X2

X3

X1 X2 X3

L1 a b v

L2 k h r

L3 u i o

X4 arrivesX4 nearest neighbors are X1 and X2

L1

L2

L3

X1

X2

X3

X4

y

z

X1 X2 X3 X4

L1 a b v min{y+a,z+b}

L2 K h r min{y+k,z+h}

L3 u i o min{y+u,z+i}

Local DB

Distance Matrix Embedding

Athens University of Economics and BusinessAthens, 31st of May 2010

53/70

Page 54: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Experimental EvaluationThe purpose of the conducted experimental evaluation process is:

Validate the non linear nature of D-Isomap on well known manifolds Highlight D-Isomap’s applicability in distributed knowledge discovery

experiments Compare D-Isomap’s performance against state of the art, centralized

methods for unsupervised clustering and classification of document collections.

Dataset Cardinality n Classes k peers Description

Swiss Roll 3000 3 --- 2 10:5:30 Swiss Roll dataset

Helix 3000 3 --- 2 10:5:30 Helix dataset

3D Clusters 3000 3 --- 2 10:5:30 Artificial 5-cluster dataset

Reuters 12216 21454 117 10:5:30 100:25:200 Reuters text collection

20 Newsgroup 18846 130080 20 100:25:200 100:25:200 20 NS text collection

Athens University of Economics and BusinessAthens, 31st of May 2010

54/70

Page 55: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

What we expect to see: Input:

Output:

Non linear manifolds (1/3)

Athens University of Economics and BusinessAthens, 31st of May 2010

55/70

Page 56: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Non linear manifolds (2/3)

Athens University of Economics and BusinessAthens, 31st of May 2010

D-Isomap with LMDS D-Isomap with FEDRA (p=2) D-Isomap with FEDRA (p=3)

D-Isomap with LMDS D-Isomap with FEDRA (p=2) D-Isomap with LMDS

56/70

Page 57: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Non linear manifolds (3/3) Network Requirements (MBs)

Network composed of 30 peers Actual size of dataset: 60KB

Athens University of Economics and BusinessAthens, 31st of May 2010

kNN Full SP Full SP,

Bound

Partial SP

Partial SP,

Bound

6 14.584 14.454 0.251 0.131

8 14.584 14.454 0.251 0.131

10 14.584 14.454 0.251 0.132

12 14.584 14.454 0.251 0.132

14 14.584 14.454 0.251 0.132

kNN Full SP Full SP,

Bound

Partial SP

Partial SP,

Bound

2 0.29 0.29 0.23 0.23

3 44.70 44.69 0.53 0.53

4 44.53 44.45 0.53 0.53

5 44.16 42.19 0.53 0.53

6 42.84 42.03 0.52 0.53

kNN Full SP Full SP,

Bound

Partial SP

Partial SP,

Bound

6 39.14 39.92 0.50 0.49

8 34.52 34.41 0.47 0.46

10 31.51 31.53 0.45 0.46

12 29.46 29.49 0.45 0.44

14 28.00 28.10 0.45 0.43

Theorem 5 reduces network requirements but is influenced by the range bound boundp

Not connected graph! Distance substitution did not work. D-Isomap failed for kNN=2

Not connected graph but distance substitution works. Larger values for kNN reduce network requirements.

57/70

Page 58: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Text Mining with D-Isomap We compare D-Isomap with

LSI LSK (kernel LSI) LPI (a hybrid of kernel LSI and Spectral Clustering)

We assume: 100:25:200 peers connected in Chord-style ring kNN = 6:2:14 for LPI and D-Isomap and cs=5 for kNN retrieval Documents are represented as vectors using Term-Frequency Norm is not normalized to 1.

Algorithms: k-Means k-NN (NN=7)

Metrics: Quality maintenance defined as F-measure in Rk/ F-measure in Rn

F-measure:= 2*precision*recall/(precision+recall) Network Load

Athens University of Economics and BusinessAthens, 31st of May 2010

58/70

Page 59: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Obtained results

Athens University of Economics and BusinessAthens, 31st of May 2010

Reuters using kNN=14 for D-Isomap and LPI

20-Newsgroup using kNN=14 for D-Isomap and LPI

Classification with k-NN (using 7NNs) Classification with k-Means

Classification with k-NN (using 7NNs) Classification with k-Means

59/70

Page 60: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Network Requirements The main disadvantage:

Network load of 4.5-6.5GB on Reuters (20-60MBs per node) Network load of 3.8-6GB on 20-Newsgroup (17-60MBs per node) Once in a lifetime of the network

Network load is minimized as kNN values grow larger Graph diameter is reduced

Athens University of Economics and BusinessAthens, 31st of May 2010

60/70

Page 61: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

SummaryDistributed Isomap:

The first, distributed, non linear dimensionality reduction algorithmManages to reveal the underlying linear nature of highly non linear manifoldsEnhances the classification ability of k-NNManages to approximately reconstruct the original dataset on a single peer node

Results obtained in a network of 200 peersExperimental validation of the curse of dimensionality and the empty space phenomenon (projecting to 0.05% of initial dimensions almost doubled the produced f-measure)D-Isomap managed to produce results of quality comparable and sometimes superior to central algorithmsDisadvantage: High network requirements

Athens University of Economics and BusinessAthens, 31st of May 2010

61/70

Page 62: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

x-SDR: An eXtensible Suite for Dimensionality Reduction

Based on :•P.Magdalinos, A.Kapernekas, A.Mpiratsis, M.Vazirgiannis, “X-SDR: An Extensible Experimentation Suite for Dimensionality Reduction” Submitted to European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under review.

•Downloadable from:•www.db-net.aueb.gr/panagis/X-SDR

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 63: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

The X-SDR Prototype An open source extensible suite

C# and Matlab http://www.db-net.aueb.gr/panagis/X-SDR/installation/downloads/xSDR-SC.7z

Aggregates well known prototypes from Data mining (Weka) Dimensionality reduction (MTDR suite)

Key features Easily extensible by the user Does not require and special programming skills Evaluation of results through specific metrics, visualization and data mining.

Exploitation Will be used in the context of data mining and machine learning courses

Athens University of Economics and BusinessAthens, 31st of May 2010

63/70

Page 64: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Conclusions and Future Research Directions

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 65: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

ConclusionsIntroduced novelties

FEDRA, a new, global, linear, approximate dimensionality reduction algorithm

Combination of low time and space requirements together with high quality results

Definition of a methodology for the decentralization of any landmark based dimensionality reduction method

Applicable in various network topologies

Definition of D-Isomap, the first distributed, non linear, global approximate dimensionality reduction algorithm

Application on knowledge discovery from text collections

A prototype enabling the experimentation with dimensionality reduction methods (x-SDR)

Athens University of Economics and BusinessAthens, 31st of May 2010

65/70

Page 66: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Future WorkD-Isomap has great potentials:

Assume a global landmark selection processGiven the low dimensional embedding d’ of any document d

d’ Є peeri = d’ Є peerj hash(d’ Є peeri) = hash(d’ Є peerj)

After termination apply a second hash function and create a new distributed hash table

Every peer is capable of answering any query. Pointers to relevant documents can be retrieved with a single message Queried peer searches locally in the approximated dataset Retrieves relevant document dr

Applies the hash function and retrieves indexing peers pind

Retrieves from pind the actual host peer (ph)

Cost is only a couple of bytes (hash(dr) and IP of ph)

Focus on applying D-Isomap in a real-life scenario!Athens University of Economics and Business

Athens, 31st of May 201066/70

Page 67: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

PublicationsAccepted: P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, “Enhancing Clustering Quality

through Landmark Based Dimensionality Reduction”, Accepted with revisions in the Transactions on Knowledge Discovery from Data, Special Issue on Large Scale Data Mining – Theory and Applications.

D.Mavroeidis, P.Magdalinos, “A Sequential Sampling Framework for Spectral k-Means based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering”, Accepted with revisions in the Transactions on Knowledge Discovery from Data.

P.Magdalinos, M.Vazirgiannis, D.Valsamou, “Distributed Knowledge Discovery with Non Linear Dimensionality Reduction”, To appear in the Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'10), Hyderabad, India, June 2010. (Acceptance Rate (full papers) 10,2%)

P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, “FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm”, In Proceedings of the SIAM International Conference on Data Mining (SDM'09), Sparks Nevada, USA, May 2009.

Athens University of Economics and BusinessAthens, 31st of May 2010

67/70

Page 68: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Publications P. Magdalinos, C.Doulkeridis, M.Vazirgiannis “K-Landmarks: Distributed

Dimensionality Reduction for Clustering Quality Maintenance”, In Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006. (Acceptance Rate (full papers) 8,8%)

P. Magdalinos, C. Doulkeridis, M. Vazirgiannis, “A Novel Effective Distributed Dimensionality Reduction Algorithm”, SIAM Feature Selection for Data Mining Workshop (SIAM-FSDM‘06), Maryland Bethesda, April 2006.

Under Review: P. Magdalinos, G.Tsatsaronis, M.Vazirgiannis, “Distributed Text Mining based on Non

Linear Dimensionality Reduction”, Submitted to European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under review.

P.Magdalinos, A.Kapernekas, A.Mpiratsis, M.Vazirgiannis, “X-SDR: An Extensible Experimentation Suite for Dimensionality Reduction” , Submitted to European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under review.

Athens University of Economics and BusinessAthens, 31st of May 2010

68/70

Page 69: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Technical ReportsTechnical Reports: D.Mavroeidis, P.Magdalinos, M.Vazirgiannis, “Distributed PCA for Network

Anomaly Detection based on Sparse PCA and Principal Subspace Stability”, AUEB 2008

Athens University of Economics and BusinessAthens, 31st of May 2010

69/70

Page 70: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Thank you!

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 71: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Back Up Slides

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 72: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Intrinsic Dimensionality with…The Eigenvalues approach

The number of principal components which retain variance above a certain threshold. (PCA)Identify a maximum eigengap which also identifies the number of data clusters (Spectral Clustering)The number of eigenvalues above a certain threshold

The Stress approachProject the dataset (or a sample) in various target dimensionalitiesPlot the derive stress values

Clustering and then PCA application Works well on non linear data

Correlation dimensions (objects closer than r are proportional to rD)

Compute C(r) = 2/n(n-1)Σi=1nΣj=i+1

nI{||xi-xj||<r}

Plot logC(r) versus logr

Athens University of Economics and BusinessAthens, 31st of May 2010

72

Page 73: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

FEDRA requirements (ext.)Artificially generated dataset: 5000 objects with 1000 dimensionsExperimental assessment of:

The dependence of FEDRA on the size of the datasetThe dependence of FEDRA on the Minkowski metric (parameter c)

Progressive augmentation of the dataset with a step of 100 objects

Page 74: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Comparing against SOTA

74

Algorithm Time Space Addition

MDS O(d3) O(d2) O(d)

PCA/SVD O(n3+n2d) O(nd + n2) O(kn)

Fastmap O(dk) O(n2) O(k)

Random Projection O(dnε-2logd)1,2 O(kn) O(nε-2logd)

Landmark MDS O(ksd+s3) O(ks) O(ns+ks)

Metric Map O(dk2+k3) O(k2) O(k2)

Boost Map O(dT) O(d) O(k)

Sparse Map O(dlog2d) O(dlog2d) O(log22d)

Vantage Objects O(dk) O(k2) O(k)

FEDRA O(cdk) O(k2) O(ck)

1. Ailon N., Chazelle, B.: Faster Dimension Reduction. Communications of ACM 52(3), pages 97-104 (2010)2. Construction of projection matrix requires O(nlogn)

Athens University of Economics and BusinessAthens, 31st of May 2010

Page 75: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

k-nn querying with FEDRA Consider two landmarks L1, L2

and an embedded object X. Range query (points r away from

X in Rn) Inside circle (L1,d(L1,X)+r)

Inside circle (L2,d(L2,X)+r) The intersection is our solution

All objects which are exactly r from X in the original space lay: Outside circle (L1,d(L1,X)-r)

Inside circle (L1,d(L1,X)+r)

Outside circle (L2,d(L2,X)-r)

Inside circle (L2,d(L2,X)+r) The common place of these

circles holds all points which exhibit distance r from X in Rn

75Athens University of Economics and BusinessAthens, 31st of May 2010

L1

L2

d(L1,X)

d(L2,X)

r

Page 76: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Sphere to Sphere Intersection Intersecting Spheres

S1:x2+y2+z2=R2

S2:(x-d)2+y2+z2=r2

S2-S1: (x-d)2+R2-x2 = r2 x2-2dx +d2 – x2 = r2 – R2 x = (d2 – r2 + R2)/2d

This is where FEDRA computations halt.

Intersection y2-z2 = R2 – x2

y2-z2 =(4d2R2-(d2-r2+R2)2)/4d2

76Athens University of Economics and BusinessAthens, 31st of May 2010

X

Z

Y

Page 77: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Random Projections (1/2) Johnson-Lindenstrauss Lemma [1984]:

For any 0<ε<1 and any integer d let k be a positive integer such that k≥4(ε2/2-ε3/3)-1lnd. Then for any set V of d points in Rn there is a map of f: Rn Rk such that for all u,vЄV, (1-ε)||u-v||2≤||f(u)-f(v)||2≤(1+ε) ||u-v||2. Further this mapping can be found in randomized polynomial time.

[Achlioptas, PODS 2001]: Two distributions +/-1 with probability 1/2 (√3)+/-1 with probability 1/6, otherwise zero

[Ailon, STOC 2006]: Cost Theoretic: O(dkn) Actual: Implementation dependent. Even in the most naïve implementation, it

is much less, since projection matrix is 1/3 full with +/-1

[Alon, Discrete Math 2003]: Projection matrix cannot become sparser Only by a factor of log(1/ε)

Athens University of Economics and BusinessAthens, 31st of May 2010

77

Page 78: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Random Projections (2/2) Fast Johnson-Lindenstrauss Transform [Ailon, Comm ACM 2010]:

Given a fixed set X of d points in Rn, ε<1 and pЄ{1,2} draw a matrix F from FJLT. With probability at least 2/3 the following two events will occur:

For any xЄX (1-ε)ap||x||2 ≤ ||Fx||p ≤ (1+ε)ap||x||2 where a1=k√2π-1 and a2 = k The mapping requires O(nlogn + nε-2logd) operations

FJLT Trick: Densification of vectors through a Fast Fourier Transform

FJLT vs Achioptas: Projection matrix is sparser than 2/3! Advantage: Faster projection Disadvantage: Bounds are guaranteed only for p=1,2

FEDRA vs Achlioptas Achlioptas bounds are stricter than FEDRA’s FEDRA provides bounds projecting from Ln

pLkp while Achlioptas from Ln

2Lkp

FEDRA projects close points closer and distant points further FEDRA vs FJLT

FJLT provides bounds for projecting from Ln2Lk

{1,2}

Athens University of Economics and BusinessAthens, 31st of May 2010

78

Page 79: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

FEDRA vs Prominent DR methods

FEDRA against PCA, SVD and Random ProjectionMetric: Incorrectly Clustered Instances (essentially 1-Purity)Depiction ICI vs Stress

79

Page 80: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Stress evolution - FEDRA

Experimental analysis indicates:The best setup should include the projection heuristicHeuristic landmark selection does not produce significantly better results than random FEDRABest setup: Random Landmark Selection and Assisted Projection

Athens University of Economics and BusinessAthens, 31st of May 2010

80

Page 81: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Experimental analysis indicates:All setups maintain clustering quality in the new space (2%-10% of initial dimensions)Best setup: Random Landmark Selection and Random Projection

Purity Evolution - FEDRA

Athens University of Economics and BusinessAthens, 31st of May 2010

81

Page 82: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Time Requirements - FEDRA

Experimental analysis indicates:Random FEDRA is fastest than any other configurationAssisted Projection is sometimes cheaper than Landmark Selection!Best setup: Random Landmark Selection and Random Projection

Athens University of Economics and BusinessAthens, 31st of May 2010

82

Page 83: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

k-Means Convergence - FEDRA

Experimental analysis indicates:All approaches exhibit approximately the same resultsLandmark Selection and Assisted Projection significantly enhance k-Means’ speed of convergence (only 10 seconds )

So which is the best setup?Based on results: Landmark Selection and Assisted Projection configurationResults vs Cost: Random FEDRA seems a viable compromising solution

Athens University of Economics and BusinessAthens, 31st of May 2010

83

Page 84: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Purity evolution (ext.)

Athens University of Economics and BusinessAthens, 31st of May 2010

84

Dataset: gamma Dataset: delta

Page 85: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

F-measure maintenance (1/2)

Athens University of Economics and BusinessAthens, 31st of May 2010

85

Evaluation of clustering using F-measureF-measure: 2*Recall*Precision/Recall+PrecisionRecall = True Positives/ True Positives + False NegativesPrecision = True Positives/ True Positives + False Positives

Dataset: alpha Dataset: beta

Page 86: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

F-measure maintenance (2/2)

86Athens University of Economics and Business

Athens, 31st of May 2010

Dataset: gamma Dataset: delta

Page 87: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

F-measure with P2P Kmeans (1/2)

87Athens University of Economics and Business

Athens, 31st of May 2010

Dataset: alpha Dataset: betaEvaluation of clustering using F-measure

F-measure: 2*Recall*Precision/Recall+PrecisionRecall = True Positives/ True Positives + False NegativesPrecision = True Positives/ True Positives + False Positives

Page 88: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

F-measure with P2P Kmeans (2/2)

88Athens University of Economics and Business

Athens, 31st of May 2010

Dataset: gamma Dataset: delta

Page 89: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

D-Isomap Requirements - Assumptions We want to follow the Isomap paradigm but apply it in a

network context. The following requirements rise:

Approximate NN querying results in a network context Consider an LSH based DHT and therefore a structured P2P

network like Chord Calculate shortest paths in distributed environment

Consider distributed shortest path algorithms widely used routing in the internet

Approximate the multidimensional scaling Consider landmark based dimensionality reduction approaches

that operate on small fractions of the whole dataset

Assumptions: M peers organized in a Chord-ring topology.

89 Athens University of Economics and BusinessAthens, 31st of May 2010

Page 90: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

p-stable distributions and LSH Definition:

A distribution D over R is called p-stable if there exists p≥0 such that for any n real numbers r1,…,rn and i.i.d variables X1,…,Xn with distribution D, the random variable ΣiriXi has the same distribution as ||r||pX where X is a random variable with distribution D.

From p-stable distributions to locality sensitive hashing Notice that rXT = ΣiriXi

Therefore given u1,u2

dp(u1,u2) = ||u1-u2||p

u1XT-u2XT = (u1-u2)XT which is distributed as dp(u1,u2)XT

Show if a = u1XT and b = u2XT a small value of |a-b| implies a small dp(u1,u2) “Small” compared to what?

Identify an interval w and map each value on this interval. h(ui) = floor(uiaT+b/w)

Collision (i.e. same hash values) translates to small |a-b|

Athens University of Economics and BusinessAthens, 31st of May 2010

90

Page 91: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Solving non connected NG problem of Isomap Instead of calculating the SPs calculate Minimum Spanning Trees:

k-connected sub graph Minimal spanning tree k-edge connected NP hard problems

Proposals: Combination of k-edge connected MSTs [D.Zhao, L.Yang, TPAMI 2009]

Also proposes solution for updating the Shortest Path Incremental Isomap [M.Law, K.Jain, TPAMI 2006]

Our “trick” for connected graphs Simple and based on the intuition that if a sub-graph is separated from the

rest then probably its points belong to a different cluster and therefore should be attributed a large value.

Inverse of the technique employed in [M.Vlachos et al. SIGKDD 2002]

Athens University of Economics and BusinessAthens, 31st of May 2010

91

Page 92: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Swiss Roll – 30 peers – various kNN

Athens University of Economics and BusinessAthens, 31st of May 2010

92

Page 93: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Helix – 30 peers – various kNN

Athens University of Economics and BusinessAthens, 31st of May 2010

93

Page 94: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

3D Clusters – 30 peers – various kNN

Athens University of Economics and BusinessAthens, 31st of May 2010

94

Page 95: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Original Values of k-Means and k-NN during D-Isomap experiments

k-NN classification results for NN=7 on Reuters F-measure ~ 0.45 (micro F-measure)

k-means clustering results for Reuters (top 10 categories) F-measure ~ 0.25

k-NN classification results for NN=7 on 20 Newsgroup F-measure ~ 0.55 (micro F-measure)

k-means clustering results for 20 Newsgroup F-measure ~ 0.22

Athens University of Economics and BusinessAthens, 31st of May 2010

95

Page 96: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Future Work (ext.) Extensions will concentrate on the following three axes

Minimize network requirements Instead of requesting the actual document retrieve its projection using

Random Projection (fixed ε) Definition of a formal method (specific for each dataset) for the

definition of Theorem 5 bound

Ameliorate the produced results Apply edge-covering techniques from graph theory in order to select a

good set of landmarks for the shortest path process

Enhance D-Isomap’s viability for large scale retrieval Create clusters of nodes, all holding the same information (i.e. Crespo &

Molina’s concept of SON) Adapt techniques from routing (i.e. OSPF) so as to enable neighboring

clusters to exchange information Adapt name resolution protocol (i.e. DNS) so as to enable quick and

reliable information retrieval from clusters.

Athens University of Economics and BusinessAthens, 31st of May 2010

96

Page 97: Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, Emmanuel

Source Code and Results For FEDRA and the Framework for Distributed Dimensionality Reduction:

www.db-net.aueb.gr/panagis/TKDD2009

For D-Isomap www.db-net.aueb.gr/panagis/PAKDD2010/ (manifold unfolding capability) www.db-net.aueb.gr/panagis/PKDD2010/ (extensions assessment and

application on text collections)

For x-SDR www.db-net.aueb.gr/panagis/X-SDR (source code, analysis, deployment

instructions)

Athens University of Economics and BusinessAthens, 31st of May 2010

97