spiros papadimitriou jimeng sun ibm t.j. watson research center hawthorne, ny, usa reporter:...

DISCO: DISTRIBUTED CO-

CLUSTERING WITH MAP-

REDUCESpiros Papadimitriou Jimeng Sun

IBM T.J. Watson Research Center

Hawthorne, NY, USA

Reporter: Nai-Hui, Ku

OUTLINE Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions

INTRODUCTION Problems

Huge datasetsNatural sources of data are impure form

Proposed MethodA comprehensive Distributed Co-clustering

(DisCo) solutionUsing HadoopDisCo is a scalable framework under which

various co-clustering algorithms can be implemented

RELATED WORK Map-Reduce framework

employs a distributed storage clusterblock-addressable storagea centralized metadata servera convenient data accessstorage API for Map-Reduce tasks

RELATED WORK Co-clustering

Algorithm cluster shapes

checkerboard partitionssingle bi-clusterExclusive row and column partitionsoverlapping partitions

Optimization criteria code length

DISTRIBUTED MINING PROCESS

Identifying the source and obtaining the data

Transform raw data into the appropriate format

for data analysis

Visual results, or turned into the input for other applications.

DISTRIBUTED MINING PROCESS (CONT.) Data pre-processing

Processing 350 GB raw network event log Needs over 5 hours to extract source/destination

IP pairsAchieve much better performance on a few

commodity nodes running HadoopSetting up Hadoop required minimal effort

DISTRIBUTED MINING PROCESS (CONT.)

Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose

During co-clustering optimization, we need to iterate over both rows and columns.

Need to pre-compute the adjacency lists for both the original graph as well as its transpose

CO-CLUSTERING HUGE DATASETS Definitions and overview

Matrices are denoted by boldface capital lettersVectors are denoted by boldface lowercase

lettersaij:the (i, j)-th element of matrix ACo-clustering algorithms employs a

checkerboard the original adjacency matrix a grid of sub-matrices

An m x n matrix, a co-clustering is a pair of row and column labeling vectors

r(i):the i-th row of the matrixG: the k×ℓ group matrix

A

a

CO-CLUSTERING HUGE DATASETS (CONT.)

gpq gives the sufficient statistics for the (p, q) sub-matrix

CO-CLUSTERING HUGE DATASETS (CONT.) Map function

CO-CLUSTERING HUGE DATASETS (CONT.) Reduce function

CO-CLUSTERING HUGE DATASETS (CONT.) Global sync

EXPERIMENTS Setup

39 nodes Two dual-core processors 8GM RAM Linux RHEL4 4Gbps Ethernets SATA, 65MB/sec or roughly 500 Mbps The total capacity of our HDFS cluster was just

2.4 terabytes HDFS block size was set to 64MB (default

value) JAVA Sun JDK version 1.6.0_03

EXPERIMENTS (CONT.) The pre-processing step on the ISS data Default values

39 nodes6 concurrent maps per node5 reduce tasks256MB input split size

EXPERIMENTS (CONT.)

CONCLUSIONS Using relatively low-cost components

I/O rates that exceed those of high-performance storage systems.

Performance scales almost linearly with the number of machines/disks.

spiros papadimitriou jimeng sun ibm t.j. watson research center hawthorne, ny, usa reporter:...

Documents

pragmatic data mining

datatransform raw data

data preprocessingprocessing

data analysisvisual

huge datasetsdefinitions

th row group

ith row

q submatrix