statistical analysis and machine learning using hadoop seungjai min samsung sds
TRANSCRIPT
Knowing that…
Hadoop/Map-Reduce has been successful
in analyzing unstructured web contents
and social media data
Another source of big data is semi-
structured machine/device generated logs,
which require non-trivial data massaging
and extensive statistical data mining
2
Question
Is Hadoop/Map-Reduce the right
framework to implement statistical
analysis (more than counting and
descriptive statistics) and machine
learning algorithms (which involve
iterations)?
3
Answer and Contents of this talk
Yes, Hadoop/Map-Reduce is the right
framework– Why is it better than MPI and CUDA?
– Map-Reduce Design Patterns
– Data Layout Patterns
No, but there are better alternatives– Spark/Shark (as an example)
– R/Hadoop (it is neither RHadoop nor Rhive)
4
Contents
Programming Models– Map-Reduce vs. MPI vs. Spark vs. CUDA(GPU)
Map-Reduce Design Patterns– Privatization Patterns (Summarization / Filtering / Clustering)
– Data Organization Patterns (Join / Transpose)
Data Layout Patterns– Row vs. Column vs. BLOB
Summary– How to choose the right programming model for your
algorithm
5
6
Parallel Programming is Difficult
Too many parallel programming models (languages)
P-threads
OpenMP UPC
Co-array Fortran
Chapel
X10
MPI
Cilk
Fortress
RapidMind
Titanium
OpenCL
PVM
Intel TBB
CUDA
Brook
Erlang
MPI Framework
7
myN = N / nprocs;for (i=0; i<=myN; i++) { A[i] = initialize(i); }
left_index = …;right_index = …;
MPI_Send(pid-1, A[left_index], sizeof(int), …);MPI_Recv(pid+1, A[right_index], sizeof(int), …);
for (i=0; i<=myN; i++) { B[i] = (A[i]+A[i+1])/2.0;}
100 101 200 201 300 301
Assembly Language of the Parallel Programming
1 400
Map-Reduce Framework
8
Map
Parallel Programming for the masses!
Map
Map
Map/Combine/Partition Shuffle
Reduce
Reduce
Reduce
key/val
key/val
key/val
key/val
key/val
key/val
output
output
output
input
input
input
Sort/Reduce
Map-Reduce vs. MPI
Similarity– Programming model
• Processes not threads
• Address spaces are separate (data communications are explicit)
– Data locality • “owner computes” rule dictates that computations
are sent to where data is not the other way round
9
Map-Reduce vs. MPI
Differences
10
Map-Reduce MPI
Expressing Parallelism
Embarrassingly Parallel (Filter +Reduction)
Almost all parallel forms but not good for task parallelism
Data Communication
Under the hood Explicit / User-provided
Data Layout (Locality)
Under the hood Explicit / User-control
Fault Tolerance Under the hood None (as of MPI1.0 and MPI 2.0)
GPU
GPGPU (General Purpose Graphic Processing Units)
10~50 times faster than CPU if an algorithm fits this model
Good for embarrassingly parallel algorithms (e.g. image)
Costs ($2K~$3.5K) and Performance (2 Quad-cores vs. One
GPU)
11
GPU
Global Memory
Local Mem Local Mem
Shared memory
$
CPU
$
CPU
$
CPU
Multi-core CPUs
12
Programming CUDA
cudaArray* cu_array; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc<float>(), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex);
__global__ void kernel(float* odata, int height, int width) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); odata[y*width+x] = c; }
Hard to program/debug hard to find good engineers hard to maintain codes
13
Design Patterns in Parallel Programming
Privatization Idiom
p_sum = 0;#pragma omp parallel private(p_sum){ #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; }
#pragma omp critical { sum += p_sum; }}
14
Design Patterns in Parallel Programming
Reduction Idiom
#define N 400#pragma omp parallel forfor (i=1; i<=N; i++) { A[i] = 1;}sum = 0;#pragma omp parallel for reduction(+:sum)for (i=1; i<=N; i++) { sum += A[i]; // dependency}printf(“sum = %d\n”, sum);
15
Design Patterns in Parallel Programming
Induction Idiom
x = K;for (i=0; i<N; i++) { A[i] = x++;}
x = K;for (i=0; i<N; i++) { A[i] = x + i;}
16
Design Patterns in Parallel Programming
Privatization Idiom
p_sum = 0;#pragma omp parallel private(p_sum){ #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; }
#pragma omp critical { sum += p_sum; }}
Map Map
Reduce
Map
Perfect fit for
Map-Reduce
framework
17
MapReduce Design Patterns
Book written by Donald Miner & Adam Shook
1. Summarization patterns
2. Filtering patterns
3. Data organization patterns
4. Join patterns
5. Meta-patterns
6. Input and output patterns
18
Design Patterns
Linear Regression (1-dimension)
y
x
y1
y2
y3
y4
y5
x1
x2
x3
x4
x5
e1
e2
e3
e4
e5
xi
yi
b
= b * +
Y = bX + e
19
Design Patterns
Linear Regression (2-dimension)
y
x1
y1
y2
y3
y4
y5
x21
x22
x23
x24
x25
e1
e2
e3
e4
e5
= b * +
Y = bX + e
x2
x11
x12
x13
x14
x15
m
n
m: # of observationsn : # of dimension
21
Design Patterns
Linear Regression
(XTX)-1 = inverse of
n
n
If n2 is sufficiently small enough Apache math
library
n should be kept small Avoid curse of
dimensionalty
22
Design Patterns
Join
ID age name … … …
100 25 Bob … … …
210 31 John … … …
360 46 Kim … … …
ID time dst … … …
100 7:28 CA … … …
100 8:03 IN … … …
210 4:26 WA … … …
A.ID A.age A.name … … …
100 25 Bob … … …
210 31 John … … …
100 25 Bob … … …
100 7:28 CA … … …
100 8:03 IN … … …
210 4:26 WA … … …
B.ID B.time B.dst … … …
Inner join
23
Design Patterns
Join
… … … … … …
100 25 Bob … … …
210 31 John … … …
360 46 Kim … … …
… … … … … …
100 7:28 CA … … …
100 8:03 IN … … …
210 4:26 WA … … …
Map
Map
Map
Reduce
Reduce
Reduce
100 25 Bob …
100 25 Bob …
210 31 John …
360 46 Kim …
Networkoverhead Reduce-side Join
Performance Overhead (1)
Map-Reduce suffers from Disk I/O bottlenecks
24
Map
Map
Map
Map/Combine/Partition Shuffle
Reduce
Reduce
Reduce
key/val
key/val
key/val
key/val
key/val
key/val
output
output
output
input
input
input
Reduce
DiskI/O
DiskI/O
Performance Overhead (2)
25
Map
Map
Map
Reduce
Reduce
Reduce
Map
Map
Map
Reduce
Reduce
Reduce
Join Groupby
Map
Map
Map
Reduce
Reduce
Reduce
Decision-Tree
DiskI/O
DiskI/O
Iterative algorithms & Map-Reduce Chaining
HBase Caching
HBase provides Scanner caching and Block
caching– Scanner caching
• setCaching(int cache);
• tells the scanner how many rows to fetch at a time
– Block caching• setCacheBlocks(true);
HBase caching helps read/write performance
but not sufficient to solve our problem
26
Spark / Shark
Spark– In-memory computing framework
– An Apache incubator project
– RDD (Resilient Distributed Datasets)
– A fault-tolerant framework
– Targets iterative machine learning algorithms
Shark– Data warehouse for Spark
– Compatible with Apache Hive 27
Spark / Shark
Scheduling
28
Mesos
Spark
Hadoop
MapReduce
Mesos / YARN
SparkHadoop
MapReduce
SparkHadoop
MapReduce
LinuxLinuxLinux
- Stand-alone Spark- No fine-grained scheduling within Spark
- No fine-grained scheduling btw Hadoop and Spark
- Mesos: Hadoop dependency - YARN
29
Time-Series Data Layout Patterns
Ti1
Ti2
Ti3
Ti4
Ti5
Ti6
Ti7
Ti8
…
Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … bin
Column RowBLOB
(uncompressed)
+ : no conversion- : slow read
+ : fast read/write- : slow conversion
+ : fast read/write- : slow search
30
Time-Series Data Layout Patterns
Ti1
Ti2
Ti3
Ti4
Ti5
Ti6
Ti7
Ti8
Ti9
…
RDB
RDB is columnar
Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 …
ColumnRow
When loading/unloading from/to RDB,
it is really important to decide whether to store in column or row format
31
R and Hadoop
R is memory-based Cannot run data
that cannot fit inside a memory
R is not thread-safe Cannot run in a
multi-threaded environment
Creating a distributed version of each and
every R function Cannot take advantage
of 3500 R packages that are already built!
32
Running R from Hadoop
t1
t2
t3
t4
…
t1M
What if the data are wide and fat?6000~7000
1M
Pros: can re-use R packages with no modification Cons: cannot handle large data that cannot fit
into memory– But, do we need large number of time-series data to
predict the future?
33
Not so big data
“Nobody ever got fired for using Hadoop on a
cluster?” – HOTCDP’12 paper
– Average Map-Reduce like jobs handle less than 14 GB
Time-series analysis for data forecasting– Sampling every minute for two-years to forecasting
next year
less than 2M rows
– It becomes big when sampling at sub-second resolution
34
Statistical Analysis and Machine Learning Library
Map-Reduce
R on Hadoop
Spark
R on a single server
+ SQL (Hive / Shark / Impala / …)
Small
Small, but many
Big
Filtering Chain, Iterative
35
Summary
Map-Reduce is surprisingly efficient framework
for most filter-and-reduce operations
As for data massaging (data pre-processing),
in-memory capability with SQL support is a
must
Calling R from Hadoop can be quite useful
when analyzing many but, not-so-big data and
is a fastest way to increase your list of
statistical and machine learning functions