spiros papadimitriou jimeng sun ibm t.j. watson research center hawthorne, ny, usa reporter:...
TRANSCRIPT
DISCO: DISTRIBUTED CO-
CLUSTERING WITH MAP-
REDUCESpiros Papadimitriou Jimeng Sun
IBM T.J. Watson Research Center
Hawthorne, NY, USA
Reporter: Nai-Hui, Ku
OUTLINE Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions
INTRODUCTION Problems
Huge datasetsNatural sources of data are impure form
Proposed MethodA comprehensive Distributed Co-clustering
(DisCo) solutionUsing HadoopDisCo is a scalable framework under which
various co-clustering algorithms can be implemented
RELATED WORK Map-Reduce framework
employs a distributed storage clusterblock-addressable storagea centralized metadata servera convenient data accessstorage API for Map-Reduce tasks
RELATED WORK Co-clustering
Algorithm cluster shapes
checkerboard partitionssingle bi-clusterExclusive row and column partitionsoverlapping partitions
Optimization criteria code length
DISTRIBUTED MINING PROCESS
Identifying the source and obtaining the data
Transform raw data into the appropriate format
for data analysis
Visual results, or turned into the input for other applications.
DISTRIBUTED MINING PROCESS (CONT.) Data pre-processing
Processing 350 GB raw network event log Needs over 5 hours to extract source/destination
IP pairsAchieve much better performance on a few
commodity nodes running HadoopSetting up Hadoop required minimal effort
DISTRIBUTED MINING PROCESS (CONT.)
Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose
During co-clustering optimization, we need to iterate over both rows and columns.
Need to pre-compute the adjacency lists for both the original graph as well as its transpose
CO-CLUSTERING HUGE DATASETS Definitions and overview
Matrices are denoted by boldface capital lettersVectors are denoted by boldface lowercase
lettersaij:the (i, j)-th element of matrix ACo-clustering algorithms employs a
checkerboard the original adjacency matrix a grid of sub-matrices
An m x n matrix, a co-clustering is a pair of row and column labeling vectors
r(i):the i-th row of the matrixG: the k×ℓ group matrix
A
a
EXPERIMENTS Setup
39 nodes Two dual-core processors 8GM RAM Linux RHEL4 4Gbps Ethernets SATA, 65MB/sec or roughly 500 Mbps The total capacity of our HDFS cluster was just
2.4 terabytes HDFS block size was set to 64MB (default
value) JAVA Sun JDK version 1.6.0_03
EXPERIMENTS (CONT.) The pre-processing step on the ISS data Default values
39 nodes6 concurrent maps per node5 reduce tasks256MB input split size