map-reduce and parallel computing for large-scale media processing youjie zhou
Post on 21-Dec-2015
215 views
TRANSCRIPT
Map-Reduce and Parallel Map-Reduce and Parallel Computing for Large-Computing for Large-
Scale Media ProcessingScale Media Processing
Youjie ZhouYoujie Zhou
OutlineOutline
►MotivationsMotivations►Map-Reduce FrameworkMap-Reduce Framework►Large-scale Multimedia Processing Large-scale Multimedia Processing
ParallelizationParallelization►Machine Learning Algorithm Machine Learning Algorithm
TransformationTransformation►Map-Reduce Drawbacks and VariantsMap-Reduce Drawbacks and Variants►ConclusionsConclusions
MotivationsMotivations
►Why we need Parallelization?Why we need Parallelization? ““Time is Money”Time is Money”
►SimultaneouslySimultaneously►Divide-and-conquerDivide-and-conquer
Data is too huge to handleData is too huge to handle►1 trillion (10^12) unique URLs in 20081 trillion (10^12) unique URLs in 2008►CPU speed limitationCPU speed limitation
MotivationsMotivations
►Why we need Parallelization?Why we need Parallelization? Increasing DataIncreasing Data
►Social NetworksSocial Networks►Scalability!Scalability!
““Brute Force”Brute Force”►No approximationsNo approximations►Cheap clusters v.s. expensive computersCheap clusters v.s. expensive computers
MotivationsMotivations
►Why we choose Map-Reduce?Why we choose Map-Reduce? PopularPopular
►A parallelization framework Google proposed A parallelization framework Google proposed and Google uses it everydayand Google uses it everyday
►Yahoo and Amazon also involve inYahoo and Amazon also involve in Popular Popular Good? Good?
►““Hides” parallelization details from usersHides” parallelization details from users►Provides high-level operations that suit for Provides high-level operations that suit for
majority algorithmsmajority algorithms Good start on deeper parallelization Good start on deeper parallelization
researchesresearches
Map-Reduce FrameworkMap-Reduce Framework
►Simple idea inspired by function Simple idea inspired by function language (like LISP)language (like LISP) mapmap
►a type of iteration in which a function is a type of iteration in which a function is successively applied to each element of one successively applied to each element of one sequencesequence
reducereduce►a function combines all the elements of a a function combines all the elements of a
sequence using a binary operationsequence using a binary operation
Map-Reduce FrameworkMap-Reduce Framework
►Data representationData representation <key,value><key,value> mapmap generates generates <key,value><key,value> pairs pairs reducereduce combines combines <key,value><key,value> pairs pairs
according to same according to same keykey
►““Hello, world!” ExampleHello, world!” Example
Map-Reduce FrameworkMap-Reduce Framework
data
split0
split1
split2
map
map
map
reduce
reduce
reduce
reduce output
Map-Reduce FrameworkMap-Reduce Framework
►Count the appearances of each Count the appearances of each different word in a set of documentsdifferent word in a set of documents
void map (Document)for each word in Document
generate <word,1>
void reduce (word,CountList)int count = 0;for each number in CountList
count += numbergenerate <word,count>
Map-Reduce FrameworkMap-Reduce Framework
►Different ImplementationsDifferent Implementations Distributed computingDistributed computing
►each computer acts as a computing nodeeach computer acts as a computing node►focusing on reliability over distributed focusing on reliability over distributed
computer networkscomputer networks►Google’s clustersGoogle’s clusters
closed sourceclosed source GFS: distributed file systemGFS: distributed file system
►HadoopHadoop open sourceopen source HDFS: hadoop distributed file systemHDFS: hadoop distributed file system
Map-Reduce FrameworkMap-Reduce Framework
►Different ImplementationsDifferent Implementations Multi-Core computingMulti-Core computing
►each core acts as a computing nodeeach core acts as a computing node► focusing on high speed computing using large shared focusing on high speed computing using large shared
memoriesmemories►PPhoenix++hoenix++
a two dimensional a two dimensional <key,value><key,value> table stored in the table stored in the memory where memory where mapmap and and reducereduce read and write pairs read and write pairs
open source created by Stanfordopen source created by Stanford►GPUGPU
10x higher memory bandwidth than a CPU10x higher memory bandwidth than a CPU 5x to 32x speedups on SVM training5x to 32x speedups on SVM training
Large-scale Multimedia Large-scale Multimedia Processing Processing ParallelizationParallelization
►ClusteringClustering k-meansk-means Spectral ClusteringSpectral Clustering
►Classifiers trainingClassifiers training SVMSVM
►Feature extraction and indexingFeature extraction and indexing Bag-of-FeaturesBag-of-Features Text Inverted IndexingText Inverted Indexing
ClusteringClustering
► k-meansk-means Basic and fundamentalBasic and fundamental Original AlgorithmOriginal Algorithm
1.1. Pick k initial center pointsPick k initial center points
2.2. Iterate until convergeIterate until converge1.1. Assign each point with the nearest centerAssign each point with the nearest center
2.2. Calculate new centersCalculate new centers
Easy to parallel!Easy to parallel!
ClusteringClustering
► k-meansk-means a shared file contains center pointsa shared file contains center points mapmap
1.1. for each point, find the nearest centerfor each point, find the nearest center2.2. generate generate <key,value><key,value> pair pair
keykey: center id: center id valuevalue: current point’s coordinate: current point’s coordinate
reducereduce1.1. collect all points belonging to the same cluster (they collect all points belonging to the same cluster (they
have the same have the same keykey value) value)2.2. calculate the average calculate the average new center new center
iterateiterate
ClusteringClustering
► Spectral ClusteringSpectral Clustering
S is huge: 10^6 points (double) need 8TBS is huge: 10^6 points (double) need 8TB Sparse It!Sparse It!
►Retain only S_ij where j is among the t nearest neighbors Retain only S_ij where j is among the t nearest neighbors of iof i
►Locality Sensitive Hashing? Locality Sensitive Hashing? It’s an approximationIt’s an approximation
►We can calculate directly We can calculate directly Parallel Parallel
2/12/1 SDDIL
ClusteringClustering
►Spectral ClusteringSpectral Clustering Calculate distance matrixCalculate distance matrix
►mapmap creates <key,value> so that every n/p points have creates <key,value> so that every n/p points have
the same keythe same key p is the number of node in the computer clusterp is the number of node in the computer cluster
►reducereduce collect points with same key so that the data is split collect points with same key so that the data is split
into p parts and each part is stored in each nodeinto p parts and each part is stored in each node
►for each point in the whole data set, on each for each point in the whole data set, on each node, find t nearest neighborsnode, find t nearest neighbors
ClusteringClustering
►Spectral ClusteringSpectral Clustering SymmetrySymmetry
►x_j in t-nearest-neighbor set of x_i ≠ x_i in t-x_j in t-nearest-neighbor set of x_i ≠ x_i in t-nearest-neighbor set of x_jnearest-neighbor set of x_j
►mapmap for each nonzero element, generates two for each nonzero element, generates two <key,value><key,value> first: first: keykey is row ID; is row ID; valuevalue is column ID and distance is column ID and distance second: second: keykey is column ID; is column ID; valuevalue is row ID and distance is row ID and distance
►reducereduce uses uses keykey as row ID and fills columns specified by column as row ID and fills columns specified by column
ID in ID in valuevalue
ClassificationClassification
► SVMSVM
),(2/1)(maxarg
)(minarg
||||
2
2/)(
jijijii
T
iT
i
xxKyy
www
w
bxwy
ClassificationClassification
► SVM SVM SMO SMO instead of solving all alpha togetherinstead of solving all alpha together coordinate ascentcoordinate ascent
►pick one alpha, fix otherspick one alpha, fix others►optimize alpha_ioptimize alpha_i
ClassificationClassification
► SVM SVM SMO SMO But we cannot optimize only one alpha for SVMBut we cannot optimize only one alpha for SVM We need to optimize two alpha each iterationWe need to optimize two alpha each iteration
n
iii
iii
jijijii
yy
yC
xxKyy
211
0,0
),(2/1)(maxarg
ClassificationClassification
►SVMSVM repeat until converge:repeat until converge:
►mapmap given two alpha, updating the optimization given two alpha, updating the optimization
informationinformation
►reducereduce find the two maximally violating alphafind the two maximally violating alpha
Feature Extraction and Feature Extraction and IndexingIndexing
►Bag-of-FeaturesBag-of-Features features features feature clusters feature clusters histogram histogram feature extractionfeature extraction
►mapmap takes images in and outputs features directlytakes images in and outputs features directly
feature clusteringfeature clustering►clustering algorithms, like k-meansclustering algorithms, like k-means
Feature Extraction and Feature Extraction and IndexingIndexing
►Bag-of-FeaturesBag-of-Features feature quantization histogramfeature quantization histogram
►mapmap for each feature on one image, find the nearest for each feature on one image, find the nearest
feature clusterfeature cluster
generates generates <imageID,clusterID><imageID,clusterID>►reducereduce
<imageID,cluster0,cluster1…><imageID,cluster0,cluster1…> for each feature cluster, updating the histogramfor each feature cluster, updating the histogram
generates generates <imageID,histogram><imageID,histogram>
Feature Extraction and Feature Extraction and IndexingIndexing
► Text Inverted IndexingText Inverted Indexing Inverted index of a termInverted index of a term
►a document list containing the terma document list containing the term►each item in the document list stores statistical informationeach item in the document list stores statistical information
frequency, position, field informationfrequency, position, field information
mapmap► for each term in one document, generates for each term in one document, generates <term,docID><term,docID>
reducereduce►<term,doc0,doc1,doc2…><term,doc0,doc1,doc2…>► for each document, update statistical information for that for each document, update statistical information for that
termterm►generates generates <term,list><term,list>
Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation
►How can we know whether an algorithm can How can we know whether an algorithm can be transformed into a Map-Reduce fashion?be transformed into a Map-Reduce fashion? if so, how to do that?if so, how to do that?
►Statistical Query and Summation FormStatistical Query and Summation Form All we want is to estimate or inferenceAll we want is to estimate or inference
►cluster id, labels…cluster id, labels…
From sufficient statisticsFrom sufficient statistics►distances between pointsdistances between points►points positionspoints positions
statistic computation can be dividedstatistic computation can be divided
Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation
► Linear RegressionLinear Regression
)()(minarg XbYTXbYError
YXXX TT 1)(
)( Tiixx )( ii yx
Summation Form
reduce
map
reduce
map
reduce
map
Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation
►Naïve BayesianNaïve Bayesian
)|()(maxarg
)]()|...([maxarg
)]...(|)()|...([maxarg
)...|(maxarg
1
11
1
jijVvNB
jjnVvMAP
njjnVvMAP
njVvMAP
vaPvPv
vPvaaPv
aaPvPvaaPv
aavPv
j
j
j
j
map
reduce
Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation
►SolutionSolution Find statistics calculation partFind statistics calculation part Distribute calculations on data using Distribute calculations on data using mapmap Gather and refine all statistics in Gather and refine all statistics in reducereduce
Map-Reduce Systems Map-Reduce Systems DrawbacksDrawbacks
►Batch based systemBatch based system ““pull” modelpull” model
►reduce must wait for un-finished mapreduce must wait for un-finished map►reduce “pull” data from mapreduce “pull” data from map
no iteration support directlyno iteration support directly
►Focusing too much on distributed Focusing too much on distributed system and failure tolerancesystem and failure tolerance local computing cluster may not need local computing cluster may not need
themthem
Map-Reduce Systems Map-Reduce Systems DrawbacksDrawbacks
►Focusing too much on distributed Focusing too much on distributed system and failure tolerancesystem and failure tolerance
Map-Reduce VariantsMap-Reduce Variants
►Map-Reduce onlineMap-Reduce online ““push” modelpush” model
►mapmap “pushes” data to “pushes” data to reducereduce reducereduce can also “push” results to can also “push” results to mapmap from from
the next jobthe next job build a pipelinebuild a pipeline
► Iterative Map-ReduceIterative Map-Reduce higher level schedulershigher level schedulers schedule the whole iteration processschedule the whole iteration process
Map-Reduce VariantsMap-Reduce Variants
►Series Map-Reduce?Series Map-Reduce?
Multi-CoreMap-Reduce
Multi-CoreMap-Reduce
Multi-CoreMap-Reduce
Multi-CoreMap-Reduce
Map-Reduce? MPI? Condor?
ConclusionsConclusions
►Good parallelization frameworkGood parallelization framework Schedule jobs automaticallySchedule jobs automatically Failure toleranceFailure tolerance Distributed computing supportedDistributed computing supported High level abstractionHigh level abstraction
►easy to port algorithms on iteasy to port algorithms on it
►Too “industry”Too “industry” why we need a large distributed system?why we need a large distributed system? why we need too much data safety?why we need too much data safety?
ReferencesReferences[1] Map-Reduce for Machine Learning on Multicore[1] Map-Reduce for Machine Learning on Multicore[2] A Map Reduce Framework for Programming Graphics Processors[2] A Map Reduce Framework for Programming Graphics Processors[3] Mapreduce Distributed Computing for Machine Learning[3] Mapreduce Distributed Computing for Machine Learning[4] Evaluating mapreduce for multi-core and multiprocessor systems[4] Evaluating mapreduce for multi-core and multiprocessor systems[5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory [5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory
SystemSystem[6] Phoenix++: Modular MapReduce for Shared-Memory Systems[6] Phoenix++: Modular MapReduce for Shared-Memory Systems[7] Web-scale computer vision using MapReduce for multimedia data mining[7] Web-scale computer vision using MapReduce for multimedia data mining[8] MapReduce indexing strategies: Studying scalability and efficiency[8] MapReduce indexing strategies: Studying scalability and efficiency[9] Batch Text Similarity Search with MapReduce[9] Batch Text Similarity Search with MapReduce[10] Twister: A Runtime for Iterative MapReduce[10] Twister: A Runtime for Iterative MapReduce[11] MapReduce Online[11] MapReduce Online[12] Fast Training of Support Vector Machines Using Sequential Minimal [12] Fast Training of Support Vector Machines Using Sequential Minimal
OptimizationOptimization[13] Social Content Matching in MapReduce[13] Social Content Matching in MapReduce[14] Large-scale multimedia semantic concept modeling using robust [14] Large-scale multimedia semantic concept modeling using robust
subspace bagging and MapReducesubspace bagging and MapReduce[15] Parallel Spectral Clustering in Distributed Systems[15] Parallel Spectral Clustering in Distributed Systems
ThanksThanks
Q & AQ & A