master–worker model for mapreduce paradigm on the tile64 ...ychung/journal/fgcs-2014.pdf ·...

Future Generation Computer Systems 36 (2014) 19–30

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Master–worker model for MapReduce paradigm on the TILE64many-core platformXuan-Yi Lin, Yeh-Ching Chung ∗Department of Computer Science, National Tsing Hua University, 101 Section 2, Kuang Fu Road, Hsinchu City 30013, Taiwan

h i g h l i g h t s

• Wemodel two shared memory master–worker programming schemes for TILE64.• We apply proposed schemes to MapReduce paradigm.• Analysis shows that the worker share is superior to the master share scheme.

a r t i c l e i n f o

Article history:Received 3 January 2013Received in revised form1 May 2013Accepted 7 May 2013Available online 16 May 2013

Keywords:Many-coreMaster–workerMapReduceShared memoryTILE64

a b s t r a c t

MapReduce is a popular programming paradigm for processing big data. It uses themaster–workermodel,which is widely used on distributed and loosely coupled systems such as clusters, to solve large problemswith task parallelism.With the ubiquity ofmany-core architectures in recent years and foreseeable future,the many-core platform will be one of the main computing platforms to execute MapReduce programs.Therefore, it is essential to optimize MapReduce programs on many-core platforms. Optimizations ofparallel programs for amany-core platform are viewed as amultifaceted problem,where both system andarchitectural factors should be taken into account. In this paper, we look into the problem by constructinga master–worker model for MapReduce paradigm on the TILE64 many-core platform. We investigatemaster share and worker share schemes for implementation of a MapReduce library on the TILE64. Thetheoretical analysis shows that the worker share scheme is inherently better for implementation ofMapReduce library on the TILE64 many-core platform.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

In recent years, the industry undergoes a transition from singlecore processors to the integration of multiple cores to producemulti-core and many-core processors due to power enveloperestrictions [1]. While the trend of processor manufacturing is toincrease the number of cores rather than clock frequency [2,3],software developers can no longer rely on the so called ‘‘freelunch’’ [4] that automatically makes existing programs run fasteron processors clocked at higher frequencies.

In order to make performance of a program scaling well withthe number of available cores on a multi-core or many-core plat-form, existing software need to be modified or re-written fromground up [5,6]. Efforts involving parallelization of an applicationare twofold, known as design and implementation. The former isabout finding concurrency in a given application and toderive algo-rithms and program structures to make it run faster, while the lat-ter is about utilization of available programming resources on the

∗ Corresponding author. Tel.: +886 35742971.E-mail addresses: [email protected] (X.-Y. Lin), [email protected]

(Y.-C. Chung).

0167-739X/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.future.2013.05.001

designated parallel platform to realize the designed algorithm andstructure. The available programming resources include program-ming language, programming paradigm, and API (applicationprogramming interface), among others. Due to the flexibility ofavailable options, there may be possible several implementationsfor a single design on a platform. Performance and scalabilitycharacteristics of completed applications may vary with differentimplementations. Thus, it is important to set guidelines for devel-opers to follow in order to produce better programs on a given plat-form.

TILE64 is a family of general purpose many-core processors de-signed and manufactured by Tilera [7]. Fig. 1 shows the architec-ture overview of a TILE64 processor. A TILE64 processor containsa two-dimensional array of 64 identical processor cores intercon-nected via multiple on-chip mesh networks named iMesh. TheiMesh is designed to be scalable to large number of cores whilemaintaining low-latency communication between tiles. Tilera pro-vides a set of proprietary APIs called iLib for programmers to writeapplication programs. The iLib provides both shared memory andmessage passing primitives for implementation of inter-processcommunication. The availability of different and varied implemen-tation options adds both flexibility and complexity in building par-allel programs on this platform.

http://dx.doi.org/10.1016/j.future.2013.05.001

http://www.elsevier.com/locate/fgcs

http://www.elsevier.com/locate/fgcs

http://crossmark.crossref.org/dialog/?doi=10.1016/j.future.2013.05.001&domain=pdf

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.future.2013.05.001

20 X.-Y. Lin, Y.-C. Chung / Future Generation Computer Systems 36 (2014) 19–30

Fig. 1. TILE64 processor architecture overview.

Fig. 2. A master–worker model.

The master–worker model has been successfully used in manyresearch areas. It is often adopted when there is a need to dy-namically balance workloads among available processors [8,9],especially in large distributed computing environments such asclusters [10], grids [11], clouds [12] and even on petascale re-sources [13]. In addition to applications in distributed computingenvironments,with the recent availability ofmulti-core andmany-core processors, the master–worker model can also be adopted insmaller-scaled systems [14]. Fig. 2 shows a generic master–workermodel, which consists of two main parts, task distribution and re-sult collection. In the task distribution part, the master generates aset of tasks and distributes them to the workers. The master canbe seen as a producer and the workers can be seen as consumers.Notwithstanding, in the result collection part, the master collectscomputation results generated by the workers. Thus, the workerscan be seen as producers and themaster can be seen as a consumer.

The MapReduce [15] paradigm has been successfully practicedon cluster systems for large scale distributed problems and the pro-cess of big data. It utilizes the master–worker model to scheduleand dispatch computational tasks over a large set of distributedcomputers. In addition to the proprietary in-house implementa-tion by Google Inc., there are also open source MapReduce im-plementations such as Hadoop [16], which is written in Java, andPhoenix [17,18], which is written in C. The Hadoop implemen-tation is primarily deployed in distributed and loosely coupledenvironments. The Phoenix implementation is developed mainly

for shared-memory architectures such as multi-core and SMP sys-tems. The MapReduce paradigm can be adopted in many differentapplication domains such as scientific computing, artificial intelli-gence, enterprise computing and image processing.

Although there are large amount of papers that discuss the ap-plications of the master–worker model on a number of systemsor platforms, only a few papers are related to the applications ofmaster–workermodel onmany-core platforms. In addition to that,although there areMapReduce implementations that target multi-core shared-memory systems, it is not yet fully investigated thescalability of the implementations on a many-core platform withon-chip interconnection networks such as TILE64. With the ubiq-uity of many-core architectures in recent years and foreseeable fu-ture, the many-core platform will be one of the main computingplatforms to execute MapReduce programs. It is important to ex-plore the problem of mapping traditional models onto many-coreplatforms.

In this paper, we study how to develop a scalable and highperformance MapReduce library similar to that of Phoenix on theTILE64 many-core platform. The management of the communi-cations between master and worker processes is the key to thesuccess of such development. We propose two shared memoryschemes, master share and worker share, to implement the sharedmemory communication between master and worker processes.We model and compare these two schemes and conclude that theworker share scheme is superior to themaster share scheme on theTILE64 many-core platform.

The rest of this paper is organized as follows. Section 2 providesbackground knowledge of TILE64 and the approach of carryingout shared memory communication between two processes onthe TILE64. In Section 3, a master–worker MapReduce systemis described and the master share and worker share schemesare introduced. Theoretical analysis is carried out in Section 4.Concluding remarks and future work are given in Section 5.

2. Preliminaries

2.1. TILE64 processor

The TILE64 processor is a many-core processor featured as anarray of 64 identical processor cores (each referred to as a tile)interconnected via the on-chip two-dimensional mesh network.The TILE64 is fully programmable using standard ANSI C underLinux environment, including a set of proprietary APIs callediLib. The iLib library supports two communication mechanisms,sharedmemory and distributedmemory, for processes running ondifferent cores to communicate with each other.

The TILE64 platform has an on-chip network named iMesh tointerconnect all 64 processor cores. All inter-process communica-tions in a multi-process program will be translated into under-lying network traffic, which is fully transparent to programmers.As a process is executing load/store instructions, it does notnecessarily having the knowledge of the overheads on the under-lying network traffic. Thus, when multiple processes are concur-rently accessing memory devices, the generated network trafficcan sometimes overwhelm the network, causing traffic conges-tions and routing delays, whichwill directly affect program perfor-mance. The inter-process communication should generate as littlenetwork traffic as possible such that the overall network perfor-mance on this many-core platform would not be pushed down.

A previous study [19] suggests that programmers can imple-ment applications in a way where producer processes alwayswrite data directly into memory addresses shared by consumerprocesses to avoid unnecessary cache coherent traffics on thememory network. In the literature, there are some discussions of

X.-Y. Lin, Y.-C. Chung / Future Generation Computer Systems 36 (2014) 19–30 21

Fig. 3. Sharing of an integer on TILE64 between two processes.

scalability issue on many-core processors featuring on-chip net-works or multiple memory controllers [20,21]. In our previouswork [22],we have shown that it is necessary to consider themem-ory hierarchy and on-chip networks in order to develop high per-formance applications on the TILE64 platform.

2.2. Shared memory communication on TILE64

In TILE64, shared memory communication allows each processin a parallel application to load/store values from/to a globallyvisible region of memory. Concurrent accesses to shared objectsmust be synchronized with mutex (mutual exclusion) locks toprevent inconsistent states.

Both the Linux and iLib programming environments providetools for allocating and synchronizing accesses to the sharedmem-ory. Linux allows programs to allocate and synchronize using thestandard Unix shared memory and pthreads APIs, while iLib sup-ports a special function for shared memory allocation, malloc_shared(), as well as an implementation of a pthreads-style mutexlock. To use iLib to implement shared memory mechanisms in aprogram, the process which shares information can call the mal-loc_shared() function to get an address pointing to a block of sharedmemory. Then the process notifies other processes the locationof shared memory by sending them messages containing this ad-dress.

Fig. 3 shows an example on the use of iLib to create an integerobject shared between 2 processes. The initialization steps are asfollows:

– There are two cores, each executes one process;– Process 0 allocates a region ofmemory to hold one integer using

malloc_shared();– The malloc_shared() function returns a value x, which is the

address of the shared integer. The value of x is stored in aninteger pointer p in process 0;

– Process 0 sends content of p to process 1;– Process 1 stores this address with integer pointer q.

After above initialization sequence, both processes 0 and 1 willbe able to load from and store to this shared integer in the sameway as normal variables. Any update to *q made by process 1 canbe seen by process 0 using *p, and vice-versa.

3. Master–worker model for MapReduce paradigm

Given an input dataset to be processed by a MapReduce pro-gram, we assume that the input dataset can be divided into ntasks that can be independently processed and outputted. The in-put dataset can be represented as a set of input data fi1 to fin, andthe output dataset is represented as fo1 to fon. Assume that the ap-plication is run on a processor, each task i takes time ti to be pro-cessed from input format to output format. The time to process allsegments is:

ni=1

ti. (1)

Fig. 4. Perfect task scheduling on 4 processors.

Fig. 5. Execution overview of master–worker MapReduce library.

The ideal case of processing such dataset using p processorswould be similar to the one shown in Fig. 4. In such ideal case,t1 = t2 = · · · = tn and n is an exact multiple of p. So the timeneeded to process all segments becomes:

ni=1

ti

p. (2)

This leads to a perfect speedup of p. In reality, it may take variableamount of time to process different data segments, and n iscommonly not an exact multiple of p.

Amaster–worker system consists of amaster processmanaginga set of worker processes. The master process distributes tasks toa set of subordinate worker processes and later collects computedresults. There are two task pools in a master–worker system, thepool of pending tasks and the pool of completed tasks. Once a workerfinishes a task, the worker process fills the result to the pool ofcompleted tasks. The master process then fetches results from thepool of completed tasks and outputs the results.

Fig. 5 illustrates the execution overview of master–workerMapReduce library. At the beginning, a user program sets up es-sential information and invoke mapreduce(). In the map phase, allworkers take split parts of the input data to compute according touser defined map() function to generate key-value pairs stored as


intermediate data. Then in the reduce phase, all workers computefinal results by running user defined reduce() function over the in-termediate data.

During the progress of task distribution, a master process isconsidered a producer process and worker processes are consid-ered consumer processes. Meanwhile, one-to-many communica-tion is raised. On the other hand, in the progress of result collection,worker processes are considered producer processes and a masterprocess is considered a consumer process. Meanwhile, many-to-one communication is raised.

The total time to process all tasks can be derived as:

ttotal = tread + tfill + tdrain + twrite + tcomp + tsync + tidle. (3)

Since time spent by workers are essentially overlapped withtime spent by the master, so the total time only counts time spentby the master. Following is a list of detailed description of compo-nents in (3):

– tread: time master spent reading input data from input tomemory;

– tfill: time master spent storing all pending tasks into pool ofpending tasks;

– tdrain: time master spent loading all pending tasks from pool ofcompleted tasks;

– twrite: time master spent writing output data from memory tooutput;

– tcomp: time master spent on computation such as decomposinginput data and composing output data;

– tsync: time master spent waiting for mutex locks to gain accessto shared objects;

– tidle: time master spent idling.

Of all the above 7 components, tread, twrite and tcomp can be seenas constants for a given input dataset, that is, these three timingvalues are not affected by system configuration variables suchas number of workers, size of task pools and how inter-processcommunications are carried out.

To look into more detail of the performance characteristics wefurther derive:

tfill =Sinput

ωmaster→pending(4)

where Sinput is the total size of input data, and ωmaster→pending isthe average throughput for master to store data into the pool ofpending tasks, and

tdrain =Soutput

ωmaster←completed(5)

where Soutput is the total size of output data and ωmaster←completedis the average throughput for master to load data from the poolof completed tasks. From (4) and (5) we know that by increasingωmaster→pending andωmaster←completed, tfill and tdrain can be shortened.

As for the synchronization time tsync, it can be seen as a functionof two variables:

tsync = F (p, q) (6)

where p is the number of shared objects in the system and q isthe number of participating processeswishing to access the sharedobjects. Usually the tsync will grow rapidly with the increment of pand q.

The master idle time tidle will come into place when both ofthe following conditions are true: (a) pool of pending tasks isfull, and (b) pool of completed tasks is empty. The occurrencerate of condition (a) is decided by pool size, ωmaster→pending andω(worker←pending)aggregated, where the latter represents aggregatedthroughput for all workers to load data from the pool of pendingtasks. Similarly, the occurrence rate of condition (b) is decided

Fig. 6. Illustration of master share.

by pool size,ωmaster←completed andω(worker→completed)aggregated, wherethe latter represents aggregated throughput for all workers tostore data to the pool of completed tasks. Ideally, the tidle canbe eliminated altogether with properly configured pool size andmaintaining:

ω(worker←pending)aggregated > ωmaster→pendingandω(worker→completed)aggregated > ωmaster←completed.

(7)

The throughput ω values in (7) will be affected by number ofprocesses in the system and how the data communications arecarried out between processes.

3.1. Shared memory schemes

Two shared memory schemes: master share and worker shareare introduced as follows. Communication between two processesusing shared memory mechanisms can be achieved by allowing aprocess to allocate a block of shared memory and then exchangethe address of shared memory between processes, that meansall participating processes in the data communication are able todirectly load value from or store value to the specified sharedmemory addresses.

3.1.1. Master shareIn themaster share scheme,master process andworker process

exchange data by using shared memory space allocated by themaster process. Fig. 6 depicts the initialization of master share,where master process allocates a region of shared memory toaccommodate shared objects. Master process then notifies workerprocess the location of shared memory, such that both master andworker can access the shared memory region processes.

By utilizing themaster share scheme, because the task pool andresult pool are memory buffers created and shared by the mas-ter process, the memory buffer will be homed to the tile runningthe master process. It means the overheads for memory load andstore will be minimal to the master process. Thus the memorybandwidth ωmaster→pending and ωmaster←completed in (7) will be rel-atively higher. However, from the perspective of worker process,because the shared memory is not homed to the tile running theworker process, the memory overheads become higher. Also theω(worker←pending)aggregated andω(worker→completed)aggregated in (7)will beconfined by the memory network bandwidth to the master tile,which causes performance bottleneck here when the number ofworker processes increases.

3.1.2. Worker shareIn the worker share scheme, the worker process allocates a

region of shared memory buffer for data sharing with masterprocess as depicted in Fig. 7. In Fig. 7, the worker process allocatesa region of shared memory to accommodate shared objects.Similarly to above discussion, the worker process then notifiesmaster process the location of shared memory, so both the worker


Fig. 7. Illustration of worker share.

Fig. 8. Memory controller and tile grouping on TILE64.

and master process will have read and write access to this sharedmemory region.

By utilizing the worker share scheme, the aggregated band-width ω(worker←pending)aggregated and ω(worker→completed)aggregated willbe increased since accesses to memory address locations will bedistributed across all worker tiles, generating a more distributedmemory network traffic to improve overall MapReduce perfor-mance.

4. Theoretical analysis

We derive performance characteristics of the master share andworker share in this section.

4.1. Shared memory access on TILE64

A TILE64 processor contains 64 cores, each referred as a tile.Tiles are identified by their coordinate in the 8 by8mesh. To denotetiles, we use the notation tile(m, n), where 0 ≤ m < 8 and 0 ≤ n <8. Fig. 8 shows a TILE64 processor, it has four memory controllerslocated at four corners of the chip. The 64 tiles in a TILE64 processorare divided into four groups. Each group contains 16 tiles and everytile in a group shares the samememory controller. Privatememoryand shared memory within a process will be allocated first to thegroup memory controllers. For example, if a process running ontile(2, 2) allocates a block of shared memory, this block of sharedmemory will be homed to tile(2, 2) and allocated toMC1.

Assuming memory is not cached, so every load and store oper-ation will go directly to the associated memory controllers. For ex-ample, assuming the case that tile(0, 0) allocates a block of shared

Fig. 9. Routing path for load and store operations from tile(3, 3) to memory sharedby tile(0, 0).

memory and shares this memory block with tile(3, 3). Load andstore accesses to memory addresses will be translated into net-work traffics in the on chip mesh network. On mesh-based net-works, dimension-order routing such as XY routing is commonlyused. In XY routing, messages sent from a source tile(m, n) to desti-nation tile(p, q) will first be routed along the X dimension to tile(m,q), then routed along the Y dimension to tile(p, q). This routing al-gorithm guarantees that not only shortest paths from any sourceto destination are selected but also deadlock-free.

Fig. 9 shows routing path for load and store operations origi-nated from tile(3, 3) to shared memory block created by tile(0, 0).For load operations, because the actual data resides in MC0, net-work messages of the data will be routed from MC0 to tile(3, 3)through a shortest path, which is

MC0 → tile(0, 3)→ tile(1, 3)→ tile(2, 3)→ tile(3, 3).

Network messages in this load operation travels through 5switches and 4 intermediate wires. For store operations, becausethe sharedmemory is allocated andmanaged by tile(0, 0) and storeoperations involves updating values, network messages generatedby store operations performed by tile(3, 3) will be routed throughshortest path from tile(3, 3) towards tile(0, 0) toMC0, which is

tile(3, 3)→ tile(3, 2)→ tile(3, 1)→ tile(3, 0)→tile(2, 0)→ tile(1, 0)→ tile(0, 0)→ MC0.

4.2. Sequential MapReduce performance

The time required for a single tile to perform MapReduceoperation over a given input dataset can be calculated as:

tmapreduce = tmap + treduce (8)

where tmap and treduce are time for completing all map and reducetasks, respectively. Furthermore, the time for map tasks is

tmap = tread_input + tcomp_map + twrite_itrm (9)

where tread_input represents memory access time required to loadall data from input, tcomp_map is the computation time spent on themap function, twrite_itrm is the time spent on storing intermediatedata into the intermediate buffer. The time for reduce tasks is

treduce = tread_itrm + tcomp_reduce + twrite_output (10)

where tread_itrm is memory access time required to load all interme-diate data from intermediate buffer, tcomp_map is the computationtime spent on the reduce function, twrite_itrm is the time spent onstoring final results into the output buffer.

The throughput for map tasks can be defined as:

ϕmap =Sinputtmap=

Sinputtread_input + tcomp_map + twrite_itrm

, (11)


Fig. 10. Message routing for memory accesses from different tiles.

and the throughput for reduce tasks can be defined as:

ϕreduce =Sitrmtreduce

=Sitrm

tread_itrm + tcomp_reduce + twrite_output. (12)

If there is only a single tile running MapReduce with all theother tiles on a TILE64 idling, the memory access times, tread_input,twrite_itrm, tread_itrm and twrite_output in (9) and (10) will not bepractically affected by the tile which the MapReduce is executedon. This is because the tile-to-tile network latency is designed tobe very low. But when there are a lot of tiles using the network,contentions might occur when the on-chip network becomes verybusy, resulting in increased latency for memory access.

4.3. Parallel MapReduce performance

With the increased number ofworker processes participating inparallel MapReduce operation, thememory access times, tread_input,twrite_itrm, tread_itrm and twrite_output in (9) and (10) for workerprocesses varies with the physical locations of tiles. A workerprocess running on a tile further from a memory controller willhave higher memory access latency under high networker trafficdue to network contention.

Fig. 10 shows an example of message routing on the TILE64for two tiles, tile(1, 1) and tile(3, 3), which share memory bufferallocated by tile(0, 0). As discussed in Section 4.1, store operationsissued by tile(1, 1) will be routed through the path MC0 →

tile(0, 3) → tile(1, 3) → tile(2, 3) → tile(3, 3), which overlapswith the path for store operations issued by tile(3, 3) on thenetwork link between tile(0, 0) and tile(1, 0).

Assume that the switch in a tile routesmessages from each portwith equal priority, when the link of an output port is fully utilized,messages received from all other input ports will share the outputbandwidth equally. For example, Fig. 11 shows the scenario where4 tiles, tile(m, n), tile(m, n−1), tile(m, n+1) and tile(m+1, n), areall sending messages to tile(m −1, n) through the on-chip switchof tile(m, n). Under such situation, if the maximum bandwidth forthe link from switch to tile(m−1, n) is λ, then the average messagerate from each source would be λ/4.

4.3.1. Master shareIf master process is running on tile(0, 0), memory bandwidth

for store operations of worker process on tile(m, n) will be

λ

3m+1 × 2n, (13)

and the memory bandwidth for load operation isλ

2m+1ifm ≤ 2

λ

3m+1 × 2n−3ifm ≥ 3,

(14)

Fig. 11. Network contention and bandwidth sharing within a tile.

and the aggregated map throughput for wworkers is

wi=0

ϕmap(i). (15)

The variable w and i in the second part of (15) represent thenumber of worker processes and worker process id, respectively.Thus we have

ϕmap(i) = ϕmap × τ

Sinputtmap(i)

=Sinputtmap× τ

tmap(i) =tmap

τ

τ =tmap

tmap(i)

=tread_input + tcomp_map + twrite_itrm

p× tread_input + tcomp_map + q× twrite_itrm

(16)

where

p =

2

i8

+1 if i%8 ≤ 2

3

i8

+1× 2(i%8)−3 if i%8 ≥ 3,

q = 3

i8

+1× 2i%8.

(17)

Similarly, aggregated reduce throughput for w worker is:

wi=0

ϕreduce(i) (18)

and

ϕreduce(i) = ϕreduce × τ , (19)

τ =treduce

treduce(i)

=tread_itrm + tcomp_map + twrite_outputtread_itrm

p + tcomp_map +twrite_output

q

(20)

where p and q are derived from (17).

4.3.2. Worker shareAlthough worker share is harder than master share to imple-

ment, if worker share is properly implemented, every worker al-locates blocks of shared memory buffers for storage of input data,intermediate data and output data. In such way, a worker will use


Fig. 12. Theoretical performance of Word Count for smaller problem size.

memory spaces allocated by itself, so memory bandwidth for bothstore and load operations of worker process on tile(m, n) will be

λ

2m+1if 0 ≤ m ≤ 3

λ

28−mif 4 ≤ m ≤ 7.

(21)

Thus to calculate ϕmap(i) in (16) and ϕreduce(i) in (19), the p andq values will be

p = q =

2

i8

+1 if 0 ≤

i8

≤ 3

28−

i8

if 4 ≤

i8

≤ 7.

(22)

4.4. Theoretical performance

To derive theoretical performance of the master share andworker share implementations we first cross-compile thePhoenix [17] MapReduce implementation onto TILE64 platform.Then we profile benchmarks that are included with Phoenix us-ing single tile to obtain timing and data size information for Eqs.(11) and (12). Thus we will be able to calculate

wi=0 ϕmap(i) andw

i=0 ϕreduce(i) by going through Eqs. (13)–(22).

4.4.1. Benchmark applicationsThere are 8 benchmarks included with the Phoenix MapReduce

library. These benchmarks represent key computations from var-ious application domains. Word Count, Reverse Index and StringMatch are for enterprise computing. Matrix Multiply is for scien-tific computing. KMeans, PCA and Linear Regression are for artifi-cial intelligence. Histogram is for image processing. Following arebrief introductions to each benchmark application.

Word Count: The input ofWord Count is a text file. It determinesfrequency of words in the input file. In the Map stage, workersprocess different sections of the input files and return intermediatedata that consist of a word (key) and a value of 1 if the word isfound. In the Reduce stage, workers add up the values for eachword (key) to obtain occurrence frequency for each word in theinput file. A 10MB text file is used as input of smaller problem sizeand a 100 MB text file is used as input of larger problem size.

Histogram: The input of Histogram is a bitmap image file. Itanalyzes the input image to compute the frequency of occurrenceof a value in the 0–255 range for the RGB components of all pixels.Similar to that of Word Count, in the Map stage, workers processdifferent portions of the image to parse the image and insert thefrequency of components occurrences into the intermediate databuffer array. In the Reduce stage, workers sumup these occurrencenumbers across all portions. A 100 MB bitmap image file is used as

input of smaller size and a 400 MB bitmap image file is used asinput of larger problem size.

Reverse Index: The input is a set of HTML files. This applicationextracts all hyperlinks in the files and generates an index fromeach unique hyperlinks to its associated file names. In the Mapstage, workers parse disjoint subsets of the input HTML files tofind hyperlinks. If a hyperlink is found, the worker outputs anintermediate pair with the link as the key and the file name asthe value. In the Reduce stage, all files referencing the same linkare combined into a single linked-list. The smaller problem setcontains a HTML file set of around 250 KB, the larger problem setcontains a HTML file set of around 1 GB.

String Match: It processes two files: ‘‘encrypt’’ and ‘‘keys’’. The‘‘encrypt’’ file contains a set of encrypted words and the ‘‘keys’’file contains a list of plain text words. This application encrypts allwords in the ‘‘keys’’ file in order to find which plain text words areused to generate the ‘‘keys’’ file. In the Map stage, workers processdifferent portions of the ‘‘keys’’ file and return the plain text wordas key and a flag indicating whether the plain text word is a matchas value. There is no actual computation task in the Reduce stageso the Reduce task is just an identity function. The size of smallerand larger input ‘‘keys’’ files are 50 and 500 MB, respectively.

PCA: This application performs a portion of the Principal Com-ponent Analysis algorithm in order to find themean vector and thecovariance matrix of a set of data points. The data is presented ina matrix as a collection of column vectors. The algorithm uses twoMapReduce iterations, first computes the mean for a set of rowsand second computes a few elements in the required covariancematrix. The Reduce task is the identity in both iterations. The inputmatrix size is 100× 100 for smaller problem set and 1000× 1000for larger problem set.

KMeans: This application utilizes KMeans iterative clusteringalgorithm to group a set of input data points into clusters. TheMapReduce function is executed iteratively until the algorithmconverges. Workers in the Map stage process subsets of the datapoints to find the distance between each point and each mean toassign the point to the closest cluster. In the Reduce stage, workersgather all points with the same cluster-id and calculate their meanvector. The input size is 100K and 500K data points for smaller andlarger problem sets.

Linear Regression: This application computes the line that best-fits a given set of coordinates in an input file. In the Map stage,workers process different portions of the input file to computesummary statistics. In the Reduce stage, the statistics are computedacross the entire dataset to finally determine the best-fit line. Thesmaller problem set is a 50 MB file and larger problem set is a 500MB file containing coordinates.

Matrix Multiply: In the Map stage, workers compute subsetsof rows of the output matrix and returns the (x, y) location ofeach element as the key and the result of the computation as thevalue. The Reduce task is just the identity function. The smallerproblem set is two 300 × 300 matrices and larger problem set istwo 600× 600 matrices.


Fig. 13. Theoretical performance of Word Count for larger problem size.

Fig. 14. Theoretical performance of Histogram for smaller problem size.

Fig. 15. Theoretical performance of Histogram for larger problem size.

Fig. 16. Theoretical performance of Reverse Index for smaller problem size.

4.4.2. Analysis resultsFollowing are the theoretical Master–worker MapReduce on

TILE64 for different benchmarks with different problem size.The theoretical performance for the 8 MapReduce benchmarks,

WordCount, Histogram, Reverse Index, StringMatch, PCA, KMeans,

Linear Regression, and Matrix Multiply from Phoenix are shown inFigs. 12–27. From the theoretical results, we can see that for allcases, the worker share scheme is superior to the master sharescheme in terms of both scalability and performance. Differentbenchmarks have different characteristics as shown in the figures.


Fig. 17. Theoretical performance of Reverse Index for larger problem size.

Fig. 18. Theoretical performance of String Match for smaller problem size.

Fig. 19. Theoretical performance of String Match for larger problem size.

Fig. 20. Theoretical performance of PCA for smaller problem size.

Most benchmarks spend the majority of total execution time inthe Map stage. It can also be observed that change of problemsize might change time proportion of Map to Reduce. With theincreasing number ofworker processes, bothMap and Reduce timecan be shortened, this also varies by benchmark and problem size.

In Fig. 28, a theoreticalmaximumspeedup for 64workers is shown.From Fig. 28, we can see that for Word Count, Reverse Index, PCAand Matrix Multiply, the worker share scheme improves speedupfor a large amount. This is due to higher aggregated memorybandwidth are demanded by these applications. On the other


Fig. 21. Theoretical performance of PCA for larger problem size.

Fig. 22. Theoretical performance of KMeans for smaller problem size.

Fig. 23. Theoretical performance of KMeans for larger problem size.

Fig. 24. Theoretical performance of Linear Regression for smaller problem size.

hand, the performance improvement of worker share over mastershare for Histogram, Kmeans and Linear Regression is relativelysmall. This is because these applications spend more time oncomputation than I/O, thus memory contention problem are lesslikely to happen in these applications.

5. Conclusion and future work

New generations of multi-core andmany-core processors bringhigher performance within same or lower power envelope. Thisadvantage comes tied with the price of complication of application


Fig. 25. Theoretical performance of Linear Regression for larger problem size.

Fig. 26. Theoretical performance of Matrix Multiply for smaller problem size.

Fig. 27. Theoretical performance of Matrix Multiply for larger problem size.

Fig. 28. Theoretical maximum speedup for 64 workers.

design and programming. Therefore, this study explores the sharedmemory programming schemes for master–worker MapReduceprocessing on TILE64many-core platform.Wemodel and comparetwo shared memory implementation schemes, master share and

worker share. Analysis shows that the worker share scheme issuperior to the master share scheme.

As further plans to the development of this research, we planto implement a MapReduce library on the TILE64 platform byincorporating both master share and worker share schemes andrun MapReduce benchmarks to verify that the experimental resultdoes match theoretical analysis. Also we would like to furtherexplore this topic by applying master share and worker shareontomore complex paradigms such as hierarchicalmaster–workerstructures.

References

[1] S. Borkar, Thousand core chips: a technology perspective, in: Proceedings ofthe 44th Design Automation Conf., DAC 07, 2007, pp. 746–749.http://dx.doi.org/10.1145/1278480.1278667.

[2] J. Parkhurst, J. Darringer, B. Grundmann, From single core to multi-core:preparing for a new exponential, in: Proceedings of the IEEE/ACM Int. Conf.Computer-Aided Design, ICCAD 06, 2006, pp. 67–72.http://dx.doi.org/10.1145/1233501.1233516.

[3] L. Karam, I. AlKamal, A. Gatherer, G. Frantz, D. Anderson, B. Evans, Trends inmulticore DSP platforms, IEEE Signal Process. Mag. 26 (6) (2009) 38–49.http://dx.doi.org/10.1109/MSP.2009.934113.

http://dx.doi.org/10.1145/1278480.1278667

http://dx.doi.org/10.1145/1233501.1233516

http://dx.doi.org/doi:10.1109/MSP.2009.934113


[4] H. Sutter, The free lunch is over: a fundamental turn toward concurrency insoftware, Dr Dobb’s J. 30 (3) (2005) 202–210.

[5] G. Chen, F. Li, S.W. Son, M. Kandemir, Application mapping for chipmultiprocessors, in: Proceedings of the 45th Design Automation Conf., DAC08, 2008, pp. 620–625.http://dx.doi.org/10.1145/1391469.1391628.

[6] G. Tan, N. Sun, G.R. Gao, A parallel dynamic programming algorithmon amulti-core architecture, in: Proceedings of the 19th ACM Symp. Parallel Algorithmsand Architectures, SPAA 07, 2007, pp. 135–144.http://dx.doi.org/10.1145/1248377.1248399.

[7] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M.Reif, B. Liewei, J. Brown, M. Mattina, M. Chyi-Chang, C. Ramey, D. Wentzlaff,W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, J.Zook, TILE64 processor: a 64-core SoCwithmesh interconnect, in: Proceedingsof the IEEE Intl. Solid-State Circuits Conf., ISSCC 08, 2008, pp. 88–598.http://dx.doi.org/10.1109/ISSCC.2008.4523070.

[8] J. Berthold, M. Dieterle, R. Loogen, S. Priebe, Hierarchical master–workerskeletons, in: Proceedings of the 10th Int. Conf. Practical Aspects of DeclarativeLanguages, PADL 08, in: LNCS, 2008, pp. 248–264.

[9] A. Benoit, L. Marchal, J.F. Pineau, Y. Robert, F. Vivien, Scheduling concurrentbag-of-tasks applications on heterogeneous platforms, IEEE Trans. Comput. 59(2) (2010) 202–217. http://dx.doi.org/10.1109/TC.2009.117.

[10] J.D. Jackson, P.J. Hatcher, Efficient parallel execution of sequence similarityanalysis via dynamic load balancing, in: Proceedings of the ISCA 3rd Int. Conf.Bioinformatics and Computational Biology, BICoB 11, 2011, pp. 219–224.

[11] J.P. Goux, S. Kulkarni, J. Linderoth, M. Yoder, An enabling framework formaster–worker applications on the computational grid, in: Proceedings ofthe 9th Int. Symp. High-Performance Distributed Computing, HDPC 00, 2000,pp. 43–50.

[12] R.M. Fujimoto, A.W. Malik, A. Park, Parallel and distributed simulation in thecloud, SCS M&S Magazine 1 (3) (2010) 1–10.

[13] M. Rynge, S. Callaghan, E. Deelman, G. Juve, G. Mehta, K. Vahi, P.J. Maechling,Enabling large-scale scientific workflows on petascale resources using MPImaster/worker, in: Proceedings of the 1st Conf. Extreme Science andEngineering Discovery Environment, XSEDE 12, 2012, pp. 1–8.http://dx.doi.org/10.1145/2335755.2335846.

[14] F. Blagojevic, D.S. Nikolopoulos, A. Stamatakis, C.D. Antonopoulos, Dynamicmultigrain parallelization on the cell broadband engine, in: Proceedings of the12th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming,2007, pp. 90–100. http://dx.doi.org/10.1145/1229428.1229445.

[15] J. Dean, S. Ghemawat,MapReduce: simplified data processing on large clusters,Commun. ACM 51 (1) (2008) 107–113.http://dx.doi.org/10.1145/1327452.1327492.

[16] A. Bialecki, M. Cafarella, D. Cutting, O. O’Malley, Hadoop: a framework forrunning applications on large clusters built of commodity hardware. Wiki athttp://wiki.apache.org/hadoop/ 11, 2005.

[17] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis, EvaluatingMapReduce for multi-core and multiprocessor systems, in: Proceedings ofthe IEEE 13th International Symposium on High Performance ComputerArchitecture. HPCA 2007, 2007, pp. 13–24.http://dx.doi.org/10.1109/HPCA.2007.346181.

[18] R. Chen, H. Chen, B. Zang, Tiled-MapReduce: optimizing resource usages ofdata-parallel applications on multicore with tiling, in: Paper Presented at theProceedings of the 19th International Conference on Parallel Architectures andCompilation Techniques, Vienna, Austria, 2010.

[19] H. Hoffmann, D.Wentzlaff, A. Agarwal, Remote store programming, in: Y. Patt,P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (Eds.), High PerformanceEmbedded Architectures and Compilers, in: LNCS, vol. 5952, Springer-Verlag,2010, pp. 3–17. http://dx.doi.org/10.1007/978-3-642-11515-8_3.

[20] M. Awasthi, D.W.Nellans, K. Sudan, R. Balasubramonian, A. Davis, Handling theproblems and opportunities posed by multiple on-chip memory controllers,in: Proceedings of the 19th Int. Conf. Parallel Architectures and CompilationTechniques, PACT 10, 2010, pp. 319–330.http://dx.doi.org/10.1145/1854273.1854314.

[21] D. Abts, N.D.E. Jerger, J. Kim, D. Gibson, M.H. Lipasti, Achieving predictableperformance through better memory controller placement in many-coreCMPs, in: Proceedings of the 36th Int. Symp. Computer Architecture, ISCA 09,2009, pp. 451–461. http://dx.doi.org/10.1145/1555754.1555810.

[22] X.-Y. Lin, C.-Y. Huang, P.-M. Yang, T.-W. Lung, S.-Y. Tseng, Y.-C. Chung,Parallelization of motion JPEG decoder on TILE64 many-core platform, in: C.-H. Hsu, V. Malyshkin (Eds.), Methods and Tools of Parallel ProgrammingMulticomputers, in: LNCS, vol. 6083, Springer-Verlag, 2011, pp. 59–68.http://dx.doi.org/10.1007/978-3-642-14822-4_7.

Xuan-Yi Lin received his B.S. degree in Computer Scienceand Information Engineering from Da-Yeh University in2001, and the M.S. degree in Information Engineeringand Computer Science from Feng Chia University in 2003.He is currently a Ph.D. student in the Department ofComputer Science, National Tsing Hua University, Taiwan.His research interests include cluster systems and cloudcomputing and many-core platforms. He is a studentmember of the IEEE computer society and ACM.

Yeh-Ching Chung received a B.S. degree in InformationEngineering from Chung Yuan Christian University in1983, and the M.S. and Ph.D. degrees in Computer and In-formation Science from Syracuse University in 1988 and1992, respectively. He joined the Department of Informa-tion Engineering at Feng Chia University as an associateprofessor in 1992 and became a full professor in 1999.From 1998 to 2001, he was the chairman of the depart-ment. In 2002, he joined the Department of Computer Sci-ence at National Tsing Hua University as a full professor.His research interests include parallel and distributed pro-

cessing, cluster systems, grid computing, multi-core tool chain design, and multi-core embedded systems. He is a member of the IEEE computer society and ACM.

http://refhub.elsevier.com/S0167-739X(13)00100-3/sbref4

http://dx.doi.org/10.1145/1391469.1391628

http://dx.doi.org/10.1145/1248377.1248399

http://dx.doi.org/10.1109/ISSCC.2008.4523070

http://dx.doi.org/doi:10.1109/TC.2009.117


http://dx.doi.org/10.1145/2335755.2335846

http://dx.doi.org/10.1145/1229428.1229445

http://dx.doi.org/doi:10.1145/1327452.1327492

http://wiki.apache.org/hadoop/

http://dx.doi.org/10.1109/HPCA.2007.346181


http://dx.doi.org/doi:10.1007/978-3-642-11515-8_3

http://dx.doi.org/10.1145/1854273.1854314

http://dx.doi.org/10.1145/1555754.1555810

http://dx.doi.org/doi:10.1007/978-3-642-14822-4_7

master–worker model for mapreduce paradigm on the tile64 ...ychung/journal/fgcs-2014.pdf ·...

Documents