map-reduce and parallel computing for large-scale media processing youjie zhou

Map-Reduce and Parallel Map-Reduce and Parallel Computing for Large-Computing for Large-

Scale Media ProcessingScale Media Processing

Youjie ZhouYoujie Zhou

OutlineOutline

►MotivationsMotivations►Map-Reduce FrameworkMap-Reduce Framework►Large-scale Multimedia Processing Large-scale Multimedia Processing

ParallelizationParallelization►Machine Learning Algorithm Machine Learning Algorithm

TransformationTransformation►Map-Reduce Drawbacks and VariantsMap-Reduce Drawbacks and Variants►ConclusionsConclusions

MotivationsMotivations

►Why we need Parallelization?Why we need Parallelization? ““Time is Money”Time is Money”

►SimultaneouslySimultaneously►Divide-and-conquerDivide-and-conquer

Data is too huge to handleData is too huge to handle►1 trillion (10^12) unique URLs in 20081 trillion (10^12) unique URLs in 2008►CPU speed limitationCPU speed limitation


►Why we need Parallelization?Why we need Parallelization? Increasing DataIncreasing Data

►Social NetworksSocial Networks►Scalability!Scalability!

““Brute Force”Brute Force”►No approximationsNo approximations►Cheap clusters v.s. expensive computersCheap clusters v.s. expensive computers


►Why we choose Map-Reduce?Why we choose Map-Reduce? PopularPopular

►A parallelization framework Google proposed A parallelization framework Google proposed and Google uses it everydayand Google uses it everyday

►Yahoo and Amazon also involve inYahoo and Amazon also involve in Popular Popular Good? Good?

►““Hides” parallelization details from usersHides” parallelization details from users►Provides high-level operations that suit for Provides high-level operations that suit for

majority algorithmsmajority algorithms Good start on deeper parallelization Good start on deeper parallelization

researchesresearches

Map-Reduce FrameworkMap-Reduce Framework

►Simple idea inspired by function Simple idea inspired by function language (like LISP)language (like LISP) mapmap

►a type of iteration in which a function is a type of iteration in which a function is successively applied to each element of one successively applied to each element of one sequencesequence

reducereduce►a function combines all the elements of a a function combines all the elements of a

sequence using a binary operationsequence using a binary operation


►Data representationData representation <key,value><key,value> mapmap generates generates <key,value><key,value> pairs pairs reducereduce combines combines <key,value><key,value> pairs pairs

according to same according to same keykey

►““Hello, world!” ExampleHello, world!” Example


data

split0

split1

split2

map

map

map

reduce

reduce

reduce

reduce output


►Count the appearances of each Count the appearances of each different word in a set of documentsdifferent word in a set of documents

void map (Document)for each word in Document

generate <word,1>

void reduce (word,CountList)int count = 0;for each number in CountList

count += numbergenerate <word,count>


►Different ImplementationsDifferent Implementations Distributed computingDistributed computing

►each computer acts as a computing nodeeach computer acts as a computing node►focusing on reliability over distributed focusing on reliability over distributed

computer networkscomputer networks►Google’s clustersGoogle’s clusters

closed sourceclosed source GFS: distributed file systemGFS: distributed file system

►HadoopHadoop open sourceopen source HDFS: hadoop distributed file systemHDFS: hadoop distributed file system


►Different ImplementationsDifferent Implementations Multi-Core computingMulti-Core computing

►each core acts as a computing nodeeach core acts as a computing node► focusing on high speed computing using large shared focusing on high speed computing using large shared

memoriesmemories►PPhoenix++hoenix++

a two dimensional a two dimensional <key,value><key,value> table stored in the table stored in the memory where memory where mapmap and and reducereduce read and write pairs read and write pairs

open source created by Stanfordopen source created by Stanford►GPUGPU

10x higher memory bandwidth than a CPU10x higher memory bandwidth than a CPU 5x to 32x speedups on SVM training5x to 32x speedups on SVM training

Large-scale Multimedia Large-scale Multimedia Processing Processing ParallelizationParallelization

►ClusteringClustering k-meansk-means Spectral ClusteringSpectral Clustering

►Classifiers trainingClassifiers training SVMSVM

►Feature extraction and indexingFeature extraction and indexing Bag-of-FeaturesBag-of-Features Text Inverted IndexingText Inverted Indexing

ClusteringClustering

► k-meansk-means Basic and fundamentalBasic and fundamental Original AlgorithmOriginal Algorithm

1.1. Pick k initial center pointsPick k initial center points

2.2. Iterate until convergeIterate until converge1.1. Assign each point with the nearest centerAssign each point with the nearest center

2.2. Calculate new centersCalculate new centers

Easy to parallel!Easy to parallel!


► k-meansk-means a shared file contains center pointsa shared file contains center points mapmap

1.1. for each point, find the nearest centerfor each point, find the nearest center2.2. generate generate <key,value><key,value> pair pair

keykey: center id: center id valuevalue: current point’s coordinate: current point’s coordinate

reducereduce1.1. collect all points belonging to the same cluster (they collect all points belonging to the same cluster (they

have the same have the same keykey value) value)2.2. calculate the average calculate the average new center new center

iterateiterate


► Spectral ClusteringSpectral Clustering

S is huge: 10^6 points (double) need 8TBS is huge: 10^6 points (double) need 8TB Sparse It!Sparse It!

►Retain only S_ij where j is among the t nearest neighbors Retain only S_ij where j is among the t nearest neighbors of iof i

►Locality Sensitive Hashing? Locality Sensitive Hashing? It’s an approximationIt’s an approximation

►We can calculate directly We can calculate directly Parallel Parallel

2/12/1 SDDIL


►Spectral ClusteringSpectral Clustering Calculate distance matrixCalculate distance matrix

►mapmap creates <key,value> so that every n/p points have creates <key,value> so that every n/p points have

the same keythe same key p is the number of node in the computer clusterp is the number of node in the computer cluster

►reducereduce collect points with same key so that the data is split collect points with same key so that the data is split

into p parts and each part is stored in each nodeinto p parts and each part is stored in each node

►for each point in the whole data set, on each for each point in the whole data set, on each node, find t nearest neighborsnode, find t nearest neighbors


►Spectral ClusteringSpectral Clustering SymmetrySymmetry

►x_j in t-nearest-neighbor set of x_i ≠ x_i in t-x_j in t-nearest-neighbor set of x_i ≠ x_i in t-nearest-neighbor set of x_jnearest-neighbor set of x_j

►mapmap for each nonzero element, generates two for each nonzero element, generates two <key,value><key,value> first: first: keykey is row ID; is row ID; valuevalue is column ID and distance is column ID and distance second: second: keykey is column ID; is column ID; valuevalue is row ID and distance is row ID and distance

►reducereduce uses uses keykey as row ID and fills columns specified by column as row ID and fills columns specified by column

ID in ID in valuevalue

ClassificationClassification

► SVMSVM

),(2/1)(maxarg

)(minarg

||||

2

2/)(

jijijii

T

iT

i

xxKyy

www

w

bxwy


► SVM SVM SMO SMO instead of solving all alpha togetherinstead of solving all alpha together coordinate ascentcoordinate ascent

►pick one alpha, fix otherspick one alpha, fix others►optimize alpha_ioptimize alpha_i


► SVM SVM SMO SMO But we cannot optimize only one alpha for SVMBut we cannot optimize only one alpha for SVM We need to optimize two alpha each iterationWe need to optimize two alpha each iteration

n

iii

iii

jijijii

yy

yC

xxKyy

211

0,0

),(2/1)(maxarg


►SVMSVM repeat until converge:repeat until converge:

►mapmap given two alpha, updating the optimization given two alpha, updating the optimization

informationinformation

►reducereduce find the two maximally violating alphafind the two maximally violating alpha

Feature Extraction and Feature Extraction and IndexingIndexing

►Bag-of-FeaturesBag-of-Features features features feature clusters feature clusters histogram histogram feature extractionfeature extraction

►mapmap takes images in and outputs features directlytakes images in and outputs features directly

feature clusteringfeature clustering►clustering algorithms, like k-meansclustering algorithms, like k-means


►Bag-of-FeaturesBag-of-Features feature quantization histogramfeature quantization histogram

►mapmap for each feature on one image, find the nearest for each feature on one image, find the nearest

feature clusterfeature cluster

generates generates <imageID,clusterID><imageID,clusterID>►reducereduce

<imageID,cluster0,cluster1…><imageID,cluster0,cluster1…> for each feature cluster, updating the histogramfor each feature cluster, updating the histogram

generates generates <imageID,histogram><imageID,histogram>


► Text Inverted IndexingText Inverted Indexing Inverted index of a termInverted index of a term

►a document list containing the terma document list containing the term►each item in the document list stores statistical informationeach item in the document list stores statistical information

frequency, position, field informationfrequency, position, field information

mapmap► for each term in one document, generates for each term in one document, generates <term,docID><term,docID>

reducereduce►<term,doc0,doc1,doc2…><term,doc0,doc1,doc2…>► for each document, update statistical information for that for each document, update statistical information for that

termterm►generates generates <term,list><term,list>

Machine Learning Algorithm Machine Learning Algorithm TransformationTransformation

►How can we know whether an algorithm can How can we know whether an algorithm can be transformed into a Map-Reduce fashion?be transformed into a Map-Reduce fashion? if so, how to do that?if so, how to do that?

►Statistical Query and Summation FormStatistical Query and Summation Form All we want is to estimate or inferenceAll we want is to estimate or inference

►cluster id, labels…cluster id, labels…

From sufficient statisticsFrom sufficient statistics►distances between pointsdistances between points►points positionspoints positions

statistic computation can be dividedstatistic computation can be divided


► Linear RegressionLinear Regression

)()(minarg XbYTXbYError

YXXX TT 1)(

)( Tiixx )( ii yx

Summation Form

reduce

map

reduce

map

reduce

map


►SolutionSolution Find statistics calculation partFind statistics calculation part Distribute calculations on data using Distribute calculations on data using mapmap Gather and refine all statistics in Gather and refine all statistics in reducereduce

Map-Reduce Systems Map-Reduce Systems DrawbacksDrawbacks

►Batch based systemBatch based system ““pull” modelpull” model

►reduce must wait for un-finished mapreduce must wait for un-finished map►reduce “pull” data from mapreduce “pull” data from map

no iteration support directlyno iteration support directly

►Focusing too much on distributed Focusing too much on distributed system and failure tolerancesystem and failure tolerance local computing cluster may not need local computing cluster may not need

themthem

Map-Reduce Systems Map-Reduce Systems DrawbacksDrawbacks

►Focusing too much on distributed Focusing too much on distributed system and failure tolerancesystem and failure tolerance

Map-Reduce VariantsMap-Reduce Variants

►Map-Reduce onlineMap-Reduce online ““push” modelpush” model

►mapmap “pushes” data to “pushes” data to reducereduce reducereduce can also “push” results to can also “push” results to mapmap from from

the next jobthe next job build a pipelinebuild a pipeline

► Iterative Map-ReduceIterative Map-Reduce higher level schedulershigher level schedulers schedule the whole iteration processschedule the whole iteration process

Map-Reduce VariantsMap-Reduce Variants

►Series Map-Reduce?Series Map-Reduce?

Multi-CoreMap-Reduce




Map-Reduce? MPI? Condor?

ConclusionsConclusions

►Good parallelization frameworkGood parallelization framework Schedule jobs automaticallySchedule jobs automatically Failure toleranceFailure tolerance Distributed computing supportedDistributed computing supported High level abstractionHigh level abstraction

►easy to port algorithms on iteasy to port algorithms on it

►Too “industry”Too “industry” why we need a large distributed system?why we need a large distributed system? why we need too much data safety?why we need too much data safety?

ReferencesReferences[1] Map-Reduce for Machine Learning on Multicore[1] Map-Reduce for Machine Learning on Multicore[2] A Map Reduce Framework for Programming Graphics Processors[2] A Map Reduce Framework for Programming Graphics Processors[3] Mapreduce Distributed Computing for Machine Learning[3] Mapreduce Distributed Computing for Machine Learning[4] Evaluating mapreduce for multi-core and multiprocessor systems[4] Evaluating mapreduce for multi-core and multiprocessor systems[5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory [5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory

SystemSystem[6] Phoenix++: Modular MapReduce for Shared-Memory Systems[6] Phoenix++: Modular MapReduce for Shared-Memory Systems[7] Web-scale computer vision using MapReduce for multimedia data mining[7] Web-scale computer vision using MapReduce for multimedia data mining[8] MapReduce indexing strategies: Studying scalability and efficiency[8] MapReduce indexing strategies: Studying scalability and efficiency[9] Batch Text Similarity Search with MapReduce[9] Batch Text Similarity Search with MapReduce[10] Twister: A Runtime for Iterative MapReduce[10] Twister: A Runtime for Iterative MapReduce[11] MapReduce Online[11] MapReduce Online[12] Fast Training of Support Vector Machines Using Sequential Minimal [12] Fast Training of Support Vector Machines Using Sequential Minimal

OptimizationOptimization[13] Social Content Matching in MapReduce[13] Social Content Matching in MapReduce[14] Large-scale multimedia semantic concept modeling using robust [14] Large-scale multimedia semantic concept modeling using robust

subspace bagging and MapReducesubspace bagging and MapReduce[15] Parallel Spectral Clustering in Distributed Systems[15] Parallel Spectral Clustering in Distributed Systems

ThanksThanks

Q & AQ & A

map-reduce and parallel computing for large-scale media processing youjie zhou

Documents

output slide

outline motivations

svm training slide

binary operation slide

cpu speed limitation

nearest center

computing node

parallel computing