k-meanscse802/clusteringslides.pdfk-means network intrusion data set ( > 4 million data points)...
TRANSCRIPT
![Page 1: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/1.jpg)
k-means
• Gaussian mixture model
• Maximize the likelihood
)2
1exp(
2
1),|(
:Centers
}{
2
2
21
21
jiji
k
n
cxcxP
,...c, cc
,...,x,xx
![Page 2: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/2.jpg)
k-means
Minimize
Sum of squared errors (SSE) criterion (k clusters and n samples)
)2
1exp(
2
1),|(
2
2 jiji cxcxP
2
ji cx
k
j Cx ji
ji cx1
2
min
![Page 3: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/3.jpg)
k-means
k-means works perfectly when clusters are “linearly separable” and spherical
![Page 4: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/4.jpg)
k-means
k-means works perfectly when clusters are “linearly separable” and spherical
![Page 5: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/5.jpg)
k-means
SSE criterion doesn’t always work
![Page 6: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/6.jpg)
k-means
What about data which contain arbitrarily shaped clusters of different densities?
![Page 7: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/7.jpg)
The Kernel Trick Revisited
![Page 8: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/8.jpg)
The Kernel Trick Revisited
Map points to “feature space” using basis function
Replace dot product (for similarity computation between points x and y) with kernel entry
)(x
)().( yx
),( yxK
Mercer’s condition: To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)= (x) (y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds ( ) ( , ) ( ) 0dxdyf x K x y f y
![Page 9: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/9.jpg)
Kernel k-means
Minimize sum of squared error:
n
i
k
j
ij jiu cx
1 1
2
mink-means:
}1,0{iju 11
k
j
iju
![Page 10: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/10.jpg)
Kernel k-means
Minimize sum of squared error:
n
i
k
j
ij jiu cx
1 1
2
min
)(xReplace with
n
i
k
j
ij jiu cx
1 1
2~)(min
k-means:
}1,0{iju 11
k
j
iju
![Page 11: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/11.jpg)
Kernel k-means
Cluster centers:
Substitute for centers:
n
i
iij
j
j xun
c1
)(1~
n
i
k
j
ij
n
i
k
j
ij
n
lllj
j
iu
jiu
xun
x
cx
1 1
2
1 1
2
1
)(1
)(
~)(
![Page 12: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/12.jpg)
Kernel k-means
• Use kernel trick:
• Optimization problem:
• K is the n x n kernel matrix, U is the optimal normalized cluster membership matrix
)'()(1 1
2~)( UKUtraceKtraceji
un
i
k
j
ij cx
)'(max)'()(min UKUtraceUKUtraceKtrace
![Page 13: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/13.jpg)
Example
2k
1x
2x
![Page 14: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/14.jpg)
Example
2k
1x
2x
k-means clusters
![Page 15: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/15.jpg)
Example
1x
2x
![Page 16: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/16.jpg)
Example
2
23212
2
11
2
221
2
121
2
,2,
),2,(),(
)'(),( kernel Polynomial
xzxxzxz
xxxxxx
yxyxK
1x
2x
1z2z
3z
![Page 17: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/17.jpg)
Example
1x
2x
2
23212
2
11
2
221
2
121
2
,2,
),2,(),(
)'(),( kernel Polynomial
xzxxzxz
xxxxxx
yxyxK
1z2z
3z
![Page 18: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/18.jpg)
Example
1x
2x
1z2z
3z
2
23212
2
11
2
221
2
121
2
,2,
),2,(),(
)'(),( kernel Polynomial
xzxxzxz
xxxxxx
yxyxK
![Page 19: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/19.jpg)
k-means Vs. Kernel k-means
k-means Kernel k-means 2k
![Page 20: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/20.jpg)
Performance of Kernel k-means
Evaluation of the performance of clustering algorithms in kernel-induced feature space, Pattern Recognition, 2005
![Page 21: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/21.jpg)
Limitations of Kernel k-means
• More complex than k-means
• Need to compute and store n x n kernel matrix
• Appropriate kernel function has to be determined
• Largest n that can be handled?
![Page 22: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/22.jpg)
Limitations of Kernel k-means
• More complex than k-means
• Need to compute and store n x n kernel matrix
• Appropriate kernel function has to be determined
• Largest n that can be handled?
• Intel Xeon E7-8837 Processor (Q2’11), Oct-core, 2.8GHz, 4TB max memory
• < 1 million points with “single” precision numbers
• May take several days to only compute the kernel matrix
![Page 23: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/23.jpg)
“Big data” Volume* – Big data comes in one size: large
*Defn. due to IBM
![Page 24: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/24.jpg)
Data Volume
Application Clustering Task Size of data Number of
features
Document retrieval Group documents of
similar topics
109 104
Gene analysis Group genes with
similar expression
levels
106 102
Image retrieval Quantize low-level
features
109 102
Earth science data
analysis
Derive climate
indices
105 102
![Page 25: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/25.jpg)
“Big data” Velocity – Often time-sensitive, big data must be
used as it is streaming
![Page 26: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/26.jpg)
“Big data” Variety – Big data extends beyond structured data,
including unstructured data of all varieties: text, audio, video, click streams, log files and more
![Page 27: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/27.jpg)
Large Scale Clustering
Deals with the first issue related to big data – the volume of data
Issues:
Computational Complexity
Hardware Limitations
Application Requirements
![Page 28: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/28.jpg)
MapReduce Framework
![Page 29: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/29.jpg)
How to distribute k-means?
![Page 30: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/30.jpg)
How to distribute k-means?
Two methods
• Distribute distance computation
![Page 31: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/31.jpg)
k-means Clustering with MapReduce - I
Distribute the cost of distance computation
Cluster centers maintained in global memory
Divide points among map tasks
Parallel k-means clustering based on MapReduce, Cloud computing, 2009
![Page 32: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/32.jpg)
k-means Clustering with MapReduce - I
Map function
Find the closest center for data point
Intermediate output: Closest cluster index
Combine function
Partially sum the values of the points assigned to the same cluster, keep track of number of points in the cluster
Reduce function
Compute new centers from the output of combine function
Parallel k-means clustering based on MapReduce, Cloud computing, 2009
![Page 33: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/33.jpg)
How to distribute k-means?
Two methods
• Distribute distance computation
![Page 34: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/34.jpg)
How to distribute k-means?
Two methods
• Distribute distance computation
• Distribute clustering task
![Page 35: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/35.jpg)
k-means Clustering with MapReduce - II
Distribute the cost of clustering
Map function
Cluster the partition into k clusters
Intermediate output: Clusters of the partition
Reduce function
Cluster the cluster centers from the map output to obtain the new centers
Fast clustering using MapReduce, KDD, 2011
![Page 36: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/36.jpg)
k-means Clustering with MapReduce - II
No global storage required
Approximate solution
Clustering error (SSE) < 2 * optimal clustering error
Fast clustering using MapReduce, KDD, 2011
![Page 37: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/37.jpg)
Machine Learning on Mapreduce
Mahout – scalable implementation of major clustering and classification algorithms on Hadoop
Open source
Java and Maven based
![Page 38: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/38.jpg)
Large Scale Kernel Clustering
Data set with 'n' points.
When n ~ 106 more than 1 TB of memory required, highly expensive computationally
Kn× n
![Page 39: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/39.jpg)
Approximate Kernel k-means
Low rank approximation
Use a small portion of the kernel matrix for clustering.
(n-m) x (n-m) chunk of the kernel matrix need not be computed
= n x n n x m m x m m x n
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011
K BK '
BK1K̂
![Page 40: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/40.jpg)
Approximate Kernel k-means
Cluster centers – linear combination of sampled points
Approximation error
m
i
ijij xc1
)ˆ(
error Clustering Optimal1
1error Clusteringm
Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, KDD, 2011
![Page 41: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/41.jpg)
Approximate Kernel k-means
![Page 42: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/42.jpg)
Performance of Approximate Kernel k-means
![Page 43: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/43.jpg)
Performance of Approximate Kernel k-means
MNIST data set (70,000 data points)
Kernel calculation Clustering
Kernel k-means 514 seconds 3953 seconds
Approximate kernel k-
means (m=1000)
8 seconds 75 seconds
About 98% reduction in time
Almost the same clustering error as kernel k-means
![Page 44: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/44.jpg)
Performance of Approximate Kernel k-means
Network Intrusion data set ( > 4 million data points)
• Kernel k-means not possible on a “normal” system
• Requires 64 TB of memory
• Approximate kernel k-means with just 40 GB memory
Kernel calculation Clustering
Approximate kernel k-
means (m=1000)
52 seconds 433 seconds
![Page 45: k-meanscse802/clusteringSlides.pdfk-means Network Intrusion data set ( > 4 million data points) • Kernel k-means not possible on a “normal” system • Requires 64 TB of memory](https://reader036.vdocuments.us/reader036/viewer/2022063004/5f86e681f21e2c605c6e5aab/html5/thumbnails/45.jpg)
Summary
• Kernel k-means
• Performs better than k-means
• Kernel clustering algorithms, in general are more complex than linear clustering algorithms
• Large scale clustering
• Distributed and approximate variants of existing algorithms required for clustering large data