Download - Co-clustering using CUDA
CO-CLUSTERING USING CUDA
Co-Clustering Explained
Problem: Large binary matrix of samples (rows) and features (columns) What samples should be grouped together? Why? What are shared features?
Co-clustering provides you the “why” explicitly Correlated sample/feature pair
Row cluster: s1 and s3 are in a group
Column cluster: distinguishing features are 2,3, and 5
Co-Clustering - Details
Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce: A case study towards
petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 Partition entire matrix into row groups, col groups
Minimize length of encoding of resulting partitioned matrix Competing code length factors: number of row groups & col groups, homogeneity
of clusters Iterate over rows, rearrange and sub-partition to find better encoding
using heuristic Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups)2*iters)
Credit: Chakrabarti et. al, KDD 2004
Implementation - Basics Initial matrix generation : CPU Initial random row/column group
assignment: CPU Memory structures very simple, arrays of
ints
Implementation – Stats step 1 Statistics calculations:
Calculates statistics for each row of each column group Statistic is number of 1’s
in a column group Straight-forward
parallelization (each thread works on one row at a time), global memory
2 3 1 3 2
35114
Column Groups
Row Groups
Stat(Row 3, ColumnGroup 3) = 1
Room For Improvement Calculate row statistics according to
histogram algorithm from text book Block columns Assign one thread block to each block Compute shared memory histograms within
block Merge back to global memory when
finished
Implementation – Stats step 2
Calculates cost for each row group of each column group Essentially a reduce on the
per-row data Block the rows, assign block
to thread block Use shared memory and
atomics to build histogram of all rows in a given row group
Merge shared histogram with global histogram for that row group
Iterate over all row groups
2 3 1 3 2
35114
Column Groups
Row Groups
Stat(RowGroup 1, ColumnGroup 3) = 2
Implementation – Row/Col Group Optimization For each row, find optimal group it could belong
to Parallelized straight-forwardly, one row per thread,
loop and stride to get all rows Each row calculation goes through all row groups,
determines global cost of moving to that row group Move all rows to their optimal group Recompute statistics Repeat for column groups Continue alternating row/column groupings until
convergence
Room For Improvement Parallelization could be more
sophisticated Could block the rows and compute the cost
of the row joining each row group in parallel Using shared memory atomics to identify
minimum cost
In practice, this algorithm heavily favors a small number of row and column groups The parllelization would be therefore be
small
Implementation Outer Loop After local minimum is found, change
initial number of row and column groups and retry Change number of row groups or number of
column groups, up or down Continue changing number of row or
column groups in that direction until cost fails to decrease
Try both directions in both dimensions before stopping
Outer loop performed on CPU
Room for Improvement Outer loop could parallelize inner loop
actions over different GPUs Each could explore the different dimensions
and directions in parallel
Implementation – CPU + Validation CPU implementation performed all steps
described earlier, but sequentially Validation
Used CPU implementation of statistics calculations to validate GPU stats calculations
CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results
Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable
Timing Measurements Time was measured by
clock_t/CLOCKS_PER_SEC under CPU implementation
Measured by cuda events under GPU implementation
Development Lessons Learned CUDA and structured data is a bad idea
Even structs of arrays are impossible to deal with Host-side pointer math on device pointers does not
work CUDA API has REALLY unfriendly error messages
Take care to do very, very little through that API __device__ variables declared globally must be
passed to kernels Runtime errors otherwise
You can malloc and free shared memory in device code as of 3.2
Development Lessons Learned Cont Visual Studio CUDA integration leaves a lot
to be desired All optimizations removed, still can’t set
breakpoints everywhere Many variables show as freed No in-IDE, real-time, in editor compile errors
But, Visual Studio does give nice auto-complete, auto-definition navigation
No CUDA linker => separate files must be directly #include’d
Experiment - Environment Float.cs.drexel.edu
CPU: 4 quad-core Intel Xeon L5360 processors @2.13 Ghz
GPU: 2 Nvidia GeForce GTX 580 GPUs @1544Mhz
Experiment - Description Sequential (CPU) and Parallel (GPU) tested on
square matrices of order 100, 1000, and 10000 Larger matrices caused memory problems
GPU tested with varying block and thread counts Num blocks: 10, 100, 5000 Num threads: 10, 100, 1024 (max)
Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order Row and column groupings are important in the
calculation of matrix statistics, rows and columns are blocked by these
Experiment Results
100 1000 100000
10
20
30
40
50
60
70
80
Speedup - 10 Blocks
101001024
Matrix Order
Num Threads
Experiment Results For small number of blocks, 100 thread
performance peaks at num_blocks * num_threads = matrix_order I would expect this to be the optimal
configuration, when num_blocks ~= num_row_groups ~= num_col_groups
Slowdown occurs when matrix order exceeds total number of threads and more must be done serially
Experiment - Results
100 1000 100000
10
20
30
40
50
60
70
80
Speedup - 100 Blocks
101001024
Matrix Order
Num Threads
Experiment Results
100 1000 100000
10
20
30
40
50
60
70
80
Speedup - 5000 Blocks
101001024
Matrix Order
Num Threads
Experiment Results Interestingly, the maximum speedup was the same
in all block counts Roughly speaking, as long as num_blocks *
num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling
overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case?
Maxing out thread count is counter-productive in smaller matrices Hypothesis: When block count is excessive (as for
small matrices), scheduling of large blocks of threads that return immediately is costly
Experiment Results
100 1000 100000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Effficiency - 10 Blocks
101001024
Matrix Order
Num Threads
Experiment Results
100 1000 100000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Efficiency - 100 Blocks
101001024
Matrix Order
Num Threads
Experiment Results
100 1000 100000
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
Efficiency - 5000 Blocks
101001024
Matrix Order
Num Threads
Experiment Results Efficiency is consistently highest for the
smaller numbers of blocks and smaller numbers of threads within those blocks Hypothesis: Overhead of starting blocks
and threads must be high enough to result in diminishing returns when adding blocks and threads