oversubscription on multicore processors

15
Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi´ c, Yili Zheng Lawrence Berkeley National Laboratory {cciancu, shofmeyr, fblagojevic, yzheng}@lbl.gov Abstract: Existing multicore systems already pro- vide deep levels of thread parallelism. Hybrid pro- gramming models and composability of parallel li- braries are very active areas of research within the scientific programming community. As more appli- cations and libraries become parallel, scenarios where multiple threads compete for a core are un- avoidable. In this paper we evaluate the impact of task oversubscription on the performance of MPI, OpenMP and UPC implementations of the NAS Parallel Benchmarks on UMA and NUMA multi- socket architectures. We evaluate explicit thread affinity management against the default Linux load balancing and discuss sharing and partitioning system management techniques. Our results indi- cate that oversubscription provides beneficial ef- fects for applications running in competitive envi- ronments. Sharing all the available cores between applications provides better throughput than ex- plicit partitioning. Modest levels of oversubscrip- tion improve system throughput by 27% and pro- vide better performance isolation of applications from their co-runners: best overall throughput is always observed when applications share cores and each is executed with multiple threads per core. Rather than “resource” symbiosis, our re- sults indicate that the determining behavioral fac- tor when applications share a system is the gran- ularity of the synchronization operations. 1. Introduction The pervasiveness of multicore processors in con- temporary computing systems will increase the demand for techniques to adapt application level parallelism to the available hardware parallelism. Modern systems are increasingly asymmetric in terms of both architecture (e.g. Intel Larrabee or GPUs) and performance (e.g. Intel Nehalem Turbo Boost). Parallel applications often have re- strictions on the degree of parallelism and threads might not be evenly distributed across cores, e.g. square number of threads. Furthermore, applica- tions using hybrid programming models and con- current execution of parallel libraries are likely to become more prevalent. Recent results illus- trate the benefits of hybrid MPI+OpenMP [7, 13] or MPI+UPC [4] programming for scientific ap- plications in multicore environments. In this set- ting MPI processes share cores with other threads. Calls to parallel scientific libraries are also likely to occur [3] in consumer applications that are ex- ecuted in competitive environments on devices where tasks are forced to share cores. As more and more applications and libraries become paral- lel, they will have to execute in asymmetric (either hardware or load imbalanced) competitive envi- ronments and one question is how well existing programming models and OS support behave in these situations. Existing runtimes for parallel sci- entific computing are designed based on the im- plicit assumption that applications run with one Operating System level task per core in a dedi- cated execution environment and this setting will provide best performance. In this paper we evaluate the impact of oversub- scription on end-to-end application performance, i.e. running an application with a number of OS

Upload: doanduong

Post on 02-Jan-2017

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Oversubscription on Multicore Processors

Oversubscription on Multicore Processors

Costin Iancu, Steven Hofmeyr, Filip Blagojevic, Yili ZhengLawrence Berkeley National Laboratory

{cciancu, shofmeyr, fblagojevic, yzheng}@lbl.gov

Abstract: Existing multicore systems already pro-vide deep levels of thread parallelism. Hybrid pro-gramming models and composability of parallel li-braries are very active areas of research within thescientific programming community. As more appli-cations and libraries become parallel, scenarioswhere multiple threads compete for a core are un-avoidable. In this paper we evaluate the impact oftask oversubscription on the performance of MPI,OpenMP and UPC implementations of the NASParallel Benchmarks on UMA and NUMA multi-socket architectures. We evaluate explicit threadaffinity management against the default Linux loadbalancing and discuss sharing and partitioningsystem management techniques. Our results indi-cate that oversubscription provides beneficial ef-fects for applications running in competitive envi-ronments. Sharing all the available cores betweenapplications provides better throughput than ex-plicit partitioning. Modest levels of oversubscrip-tion improve system throughput by 27% and pro-vide better performance isolation of applicationsfrom their co-runners: best overall throughput isalways observed when applications share coresand each is executed with multiple threads percore. Rather than “resource” symbiosis, our re-sults indicate that the determining behavioral fac-tor when applications share a system is the gran-ularity of the synchronization operations.

1. IntroductionThe pervasiveness of multicore processors in con-temporary computing systems will increase the

demand for techniques to adapt application levelparallelism to the available hardware parallelism.Modern systems are increasingly asymmetric interms of both architecture (e.g. Intel Larrabeeor GPUs) and performance (e.g. Intel NehalemTurbo Boost). Parallel applications often have re-strictions on the degree of parallelism and threadsmight not be evenly distributed across cores, e.g.square number of threads. Furthermore, applica-tions using hybrid programming models and con-current execution of parallel libraries are likelyto become more prevalent. Recent results illus-trate the benefits of hybrid MPI+OpenMP [7, 13]or MPI+UPC [4] programming for scientific ap-plications in multicore environments. In this set-ting MPI processes share cores with other threads.Calls to parallel scientific libraries are also likelyto occur [3] in consumer applications that are ex-ecuted in competitive environments on deviceswhere tasks are forced to share cores. As moreand more applications and libraries become paral-lel, they will have to execute in asymmetric (eitherhardware or load imbalanced) competitive envi-ronments and one question is how well existingprogramming models and OS support behave inthese situations. Existing runtimes for parallel sci-entific computing are designed based on the im-plicit assumption that applications run with oneOperating System level task per core in a dedi-cated execution environment and this setting willprovide best performance.

In this paper we evaluate the impact of oversub-scription on end-to-end application performance,i.e. running an application with a number of OS

Page 2: Oversubscription on Multicore Processors

level tasks larger than the number of availablecores. We explore single and multi-applicationusage scenarios with system partitioning or shar-ing for several implementations of the NAS Par-allel Benchmarks [20] (UPC, MPI+Fortran andOpenMP+Fortran) on multi-socket NUMA (AMDBarcelona and Intel Nehalem) and UMA (In-tel Tigerton) multicore systems running Linux.To our knowledge, this is the first evaluation ofoversubscription, sharing and partitioning on non-simulated hardware at the core concurrency lev-els available today. Our results indicate that whiletask oversubscription sometimes affects perfor-mance in dedicated environments, it is beneficialfor the performance of parallel workloads in com-petitive environments, irrespective of the program-ming model. Modest oversubscription (2,4,8 tasksper core) improves system throughput by 27%when compared to running applications in isola-tion and in some cases end-to-end performance(up to 46%). For the multi-socket systems ex-amined, partitioning sockets between applicationsin a combined workload almost halves the sys-tem throughput, regardless of the programmingparadigm. Partitioning [17, 21, 26] has been in-creasingly advocated recently and this particularresult provides a cautionary tale when consideringscientific workloads. Oversubscription providesan easy way to reduce the impact of co-runnerson the performance of parallel applications and itis also a feasible technique to allocate resources toapplications: there is correlation between the num-ber of threads present in the system and observedoverall performance.

When examining an isolated application, ourresults indicate that the average inter-barrier in-terval is a good predictor of its behavior withoversubscription. Fine grained applications (fewms inter-barrier intervals) are likely to see perfor-mance degradation, while coarser grained applica-tions speed-up or are not affected. The observedbehavior is architecture or programming modelindependent. When examining co-scheduling ofapplications, again the synchronization granular-ity is an accurate predictor of their behavior. The

results presented in this study suggest that symbi-otic scheduling of parallel scientific applicationson multicore systems depends on the synchroniza-tion behavior, rather than on cache or CPU usage.

One of the more surprising results of this studyis that, irrespective of the programming model,best overall system throughput is obtained wheneach application is allowed to run with multi-ple threads per core in a non-dedicated environ-ment. The throughput improvements provided byoversubscription in a “symbiosis” agnostic man-ner match or exceed the improvements reportedby existing studies of co-scheduling on multicoreor shared memory systems. As discussed in Sec-tion 8, our evaluation of OpenMP oversubscriptionexhibits different trends than previously reported.Our results are obtained only when enforcing aneven thread distribution across cores at programstartup; oversubscription degrades performance inall cases where the default Linux load balancingis used. For the applications considered, an eveninitial task distribution seems to be the only re-quirement for good performance, rather than care-ful thread to core mappings. The performance re-sults for experiments where threads are randomlypinned to cores are statistically indistinguishable.

2. Oversubscription andPerformance

When threads share cores several factors impactthe end-to-end application performance: 1) con-text switching; 2) load balancing; 3) inter-threadsynchronization overhead and 4) system parti-tioning. The impact of context switching is de-termined by the direct OS overhead of schedul-ing and performing the task switching and by theindirect overhead of lost hardware state (local-ity): caches and TLBs. Li et al [14] present mi-crobenchmark results to quantify the impact oftask migration on modern multicore processors.Their results indicate overheads with granularitiesranging from few microseconds for CPU inten-sive tasks to few milliseconds for memory inten-sive tasks; the scheduling time quanta is around100ms. Our application results also indicate that

Page 3: Oversubscription on Multicore Processors

context switching and loss of hardware state donot significantly impact performance for the ob-served applications: therefore in the rest of thispaper we do not quantify these metrics.

Load balancing determines the “spatial” distri-bution of tasks on available cores and sometimestries to address locality concerns. Recent studiesadvocate explicit thread affinity management [18]using the sched setaffinity system call andbetter system load balancing [11,15] mechanisms.Explicitly managing thread affinity can lead tonon-portable implementations since it cannot ac-commodate un-even task distributions. In [11] wepresent a user level load balancer able to providescalable performance for arbitrary task distribu-tions: those results also substantiate the fact thatthe impact of “hardware” locality can be ignored.For conciseness reasons, in this paper we presentresults for experiments performed with even taskdistributions across cores and concentrate the ex-position on the impact of synchronization and sys-tem partitioning.

Parallel applications have data and control de-pendencies and threads will “stall” waiting forother threads. In these cases oversubscription hasthe potential to improve application performancewith better load balancing and CPU utilization. Onthe other hand, increasing the number of threadshas the potential to increase the duration of syn-chronization operations due to greater operationalcomplexity and hardware contention. The imple-mentation of these operations often interacts withthe per-core task scheduler through thesched yield system call which influences thetemporal ordering of task execution. In order toincrease CPU utilization and system responsive-ness, most implementations use a combination ofpolling and task yielding rather than blocking op-erations. Similar techniques are used in communi-cation libraries when checking for completion ofoutstanding operations.

System partitioning answers the question ofhow should multicore systems be managed: threadseither competitively share a set of cores or canbe isolated into their own runtime partitions. Sev-

eral [17, 21, 25, 26] research efforts advocate forsystem partitioning as a way of improving bothperformance and isolation.

We evaluate the performance in the presenceof oversubscription of three programming mod-els: MPI, UPC and OpenMP. MPI is the domi-nant programming paradigm for parallel scientificapplications and UPC [23] belongs to the emerg-ing family of Partitioned Global Address Spacelanguages. MPI uses OS processes while UPCuses pthreads and exhibits a lower contextswitch cost. Both UPC and MPI programs per-form inter-task synchronization in barrier opera-tions while MPI programs might also synchronizein Send/Rcv pairs. OpenMP runs with pthreadsand performs barrier synchronization when exitingparallel regions and might synchronize inside par-allel regions. The three runtimes considered yieldinside synchronization operations in the presenceof oversubscription. Hybrid implementations us-ing combinations of these programming modelsare already encountered in practice [4, 7, 13].

3. Experimental SetupWe experimented on the multicore architecturesshown in Table 1. The Tigerton system is a UMAquad-socket, quad-core Intel Xeon where eachpair of cores shares an L2 cache and each socketshares a front-side bus. The Barcelona system isa NUMA quad-socket, quad-core AMD Opteronwhere cores within a socket share an L3 cache.The Nehalem system is a dual-socket quad-coreIntel Nehalem system with hyperthreading; eachcore supports two hardware execution contexts.All systems run recent Linux kernels (2.6.28 onthe Tigerton , 2.6.27 on the Barcelona and 2.6.30on the Nehalem ).

We use implementations of the NAS ParallelBenchmarks (NPB), classes S, A, B and C: ver-sion 2.4 for UPC [10] and 3.3 for OpenMP andMPI [20]. All programs (CG, MG, IS, FT, EP,SP) have been compiled with the Intel 10.1 com-pilers (icc, ifort) which provide good se-rial performance on both architectures. The UPCbenchmarks are compiled with the Berkeley UPC

Page 4: Oversubscription on Multicore Processors

Processor Clock GHz Cores L1 data/instr L2 cache L3 cache Memory/core NUMATigerton Intel Xeon E7310 1.6 16 (4x4) 32K/32K 4M / 2 cores none 2GB noBarcelona AMD Opteron 8350 2 16 (4x4) 64K/64K 512K / core 2M / socket 4GB socketNehalem Intel Xeon E5530 2.4 16 (2x4x2) 32K/32K 256K / core 8M / socket 1.5G / core socket

Table 1. Test systems.

0

10

20

30

40

50

60

1/core2/core4/core1/core2/core4/core1/core2/core4/core

UPC OpenMP MPI

Time(

micro

sec)

BarrierPerformance‐AMDBarcelona

1

2

4

8

16

160

Figure 1. Barrier performance with oversubscriptionat different core counts (legend) on AMD Barcelona. Sim-ilar results are observed on all systems.

0.1

1

10

100

1000

10000

A B C A B C A B C A B C A B C A B C A B C

Inter

-bar

rier t

ime (

ms)

UPC NPB 2.4 Barrier Stats, 16 threads

3777

17877

17877

13

13

13

56

140

50

91

91

91

3781114

1240

13677

13677

13677

7688

7688

7688

btspmgisftepcg

Figure 2. Average time between two barriers and bar-rier count for the UPC benchmarks on Nehalem . Similartrends are observed on all systems and implementations.

2.8.0 compiler which uses -O3 for the icc back-end compiler, while the Fortran benchmarks werecompiled with -fast, which includes -O3. Un-less specified otherwise, OpenMP is compiled andexecuted with static scheduling. We have usedMPICH 2 on all architectures. Due to space re-strictions we will not discuss the details of theNAS benchmarks (for a detailed discussion kindlysee [20]).

The execution time across all benchmarks rangesfrom a few seconds to hundreds of seconds whilethe memory footprints range from few MB toGB. For example, the data domain in FT classC is a grid of size 512x512x512 of complex datapoints: this amounts to more than 2GB of data.Thus, we have a reasonable sample of short andlong lived applications and a sample of small andlarge memory footprints. For all benchmarks, wecompare executions using the default Linux loadbalancing with explicit thread affinity manage-ment, referred to as PIN. For PIN we generaterandom initial pinnings to capture different initialconditions and thread interactions and pin tasksto cores using the sched setaffinity sys-tem call at the beginning of program execution. Tocapture variation in execution times each experi-ment has been repeated five or more times; mostexperiments have at least 15 settings for the initialtask layout. As a performance metric we use theaverage benchmark running time across all repeti-tions.

4. Benchmark CharacteristicsFigure 1 presents the behavior of barrier imple-mentations for the three programming models inthe presence of oversubscription: increasing thenumber of threads per core increases the barrierlatency from few µs to tens of µs. The MPI over-subscribed barrier latency is greater than UPCand OpenMP due to the more expensive pro-cess context switch. Note that these microbench-mark results provide a lower bound for the bar-rier latency when used in application settings.UPC and MPI call sched yield inside bar-riers when oversubscribed. The Intel OpenMPruntime provides a tunable implementation con-trolled by the KMP BLOCKTIME environmentvariable. Unless specified otherwise, all results usethe default behavior of polling for 200 ms beforethreads sleep. The other relevant settings areKMP BLOCKTIME=0 where threads sleep im-mediately and is designed for sharing the systemwith other applications andKMP BLOCKTIME=infinite where threads neversleep and is designed for dedicated system use.The barrier results with the default setting are rep-resentative for the other two settings.

Figure 2 presents the synchronization behav-ior of the UPC implementations on Nehalem : theheight of the bars indicates the average time be-tween two barriers in ms, while the labels showthe number of executed barriers. The OpenMPand MPI results are very similar and are omitted

Page 5: Oversubscription on Multicore Processors

for brevity. Profiling data shows that most bench-marks exhibit a multi-modal inter-barrier intervaldistribution, however, as our results indicate, theaverage interval is a robust predictor for the behav-ior with oversubscription. We also omit the dataabout application memory footprints (see [20] formore details).

All implementations exhibit a good load bal-ance: the UPC and MPI implementations havebeen developed for clusters and have an even do-main decomposition, the OpenMP implementa-tions distribute loops evenly across threads.

5. Scalability and OversubscriptionIn this section we discuss the effects of oversub-scription (up to eight tasks per core) on end-to-end benchmark performance in a dedicated en-vironment: each benchmark is run by itself. Allbenchmarks in the workload (class A,B,C) scaleup to 16 cores on all systems. Figures 3, 4, 7 and 8present selected results for all workloads. For eachbenchmark we present performance normalized tothe performance of the experiment with one threadper core: values greater than one indicate perfor-mance improvements. The total height of the barsindicates the behavior when tasks are evenly dis-tributed and explicitly pinned to cores at startup.

The UPC workload is not affected by oversub-scription. The average performance of the wholeworkload decreases or increases by −2% and2% respectively, depending on the number ofthreads per core. We observe several types of be-havior when considering individual benchmarks.EP, which is computationally intensive and verycoarse grained, is oblivious to oversubscription.This also indicates that the OS overhead for man-aging the increased thread parallelism is not pro-hibitive. Oversubscription improves performancefor FT and IS with a maximum improvement of46%. The performance improvements increasewith the degree of oversubscription. As the prob-lem size increases, the synchronization granularityof SP and MG also increases and oversubscriptionis able to provide better performance; increasingthe degree of oversubscription enhances the re-

spective trend. CG performance proportionally de-creases with the degree of oversubscription, with amaximum slowdown of 44%. Note that all bench-marks where performance degrades with oversub-scription are characterized by a short inter-barrierinterval (e.g. 1-5 ms on Nehalem) and a largenumber of barrier operations.

The MPI workload is affected more by oversub-scription and overall workload performance de-creases by 10% when applications are run withtwo threads per core. Again, most benchmarksare oblivious to oversubscription and performancedegradation is observed only for the very finegrained benchmarks. The UPC and MPI imple-mentations use similar domain decomposition foreach benchmark and some of the differences fromthe UPC workload could be attributed to bothhigher task context switch cost (MPI uses pro-cesses, while UPC uses pthreads) and to thehigher overhead of the MPI barriers with oversub-scription. We isolate the impact of context switch-ing by running 1 the UPC workload with processesand shared memory inter-process communicationmechanisms. The behavior of the UPC workloadremains almost identical when running with pro-cesses, therefore we attribute any different trendsin the MPI behavior to the barrier performance.

The OpenMP behavior on Nehalem is presentedin Figure 8, similar behavior is observed on theother architectures. Oversubscription slightly de-creases overall (classes A,B,C) throughput, againdue to the decrease in performance for the finegrained benchmarks. For reference, we also presentthe behavior of class S. For this problem size thebase implementations scale poorly with the in-crease in cores even when executing with onethread per core. The results presented are withthe default setting of KMP BLOCKTIME andOMP STATIC. The OpenMP runtime concurrencyand scheduling can be changed using the environ-ment variables OMP DYNAMIC and OMP GUIDED.We have experimented with these settings, but bestperformance for this implementation of the NASbenchmarks is obtained OMP STATIC. Liao et

1 This Berkeley UPC runtime will be soon made available.

Page 6: Oversubscription on Multicore Processors

UPC Tigerton

0

0.5

1

1.5

2

248 248 248

Perfo

rman

ce re

lative

to 1

/core

epCBA

24 248 248

ftCBA

248 248 248

isCBA

4 4 4

spCBA

248 248 248

mgCBA

24 248 248

cg

CFSPSX yield

PIN

CBA

Figure 3. UMA oversubscription UPC. Perfor-mance is normalized to that of experiments with 1task per core. Number of tasks per core can be 2, 4or 8. SP requires a square number of threads. Overallworkload performance varies from -2% to 2%.

UPC Barcelona

0

0.5

1

1.5

2

248 248 248

Perfo

rman

ce re

lative

to 1

/core

epCBA

24 248 248

ftCBA

248 248 248

isCBA

4 4 4

spCBA

248 248 248

mgCBA

24 248 248

cg

CFSPSX yield

PIN

CBA

Figure 4. NUMA oversubscription UPC. Perfor-mance is normalized to that of experiments with 1task per core. Number of tasks per core can be 2, 4or 8. SP requires a square number of threads. Overallworkload performance varies from -2% to 2%.

Balance UPC Tigerton

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

248 248 248

Impr

ovem

ent o

ver 1

/core

epCBA

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

24 248 248

ftCBA

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

248 248 248

isCBA

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

4 4 4

spCBA

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

248 248 248

mgCBA

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

24 248 248

cgCBA

Figure 5. Changes in balance on UMA, reportedas the ratio between the lowest and highest user timeacross all cores compared to the 1/core setting.

Cache miss rate UPC Tigerton

-0.4

-0.2

0

0.2

0.4

248 248 248

Impr

ovem

ent o

ver 1

/core

epCBA

-0.4

-0.2

0

0.2

0.4

24 248 248

ftCBA

-0.4

-0.2

0

0.2

0.4

248 248 248

isCBA

-0.4

-0.2

0

0.2

0.4

4 4 4

spCBA

-0.4

-0.2

0

0.2

0.4

248 248 248

mgCBA

-0.4

-0.2

0

0.2

0.4

24 248 248

cgCBA

Figure 6. Changes in the total number of cachemisses per 1000 instructions, across all cores com-pared to 1/core. The EP miss rate is very low.

MPI Tigerton

0

0.5

1

1.5

2

24 24 24

Perfo

rman

ce re

lative

to 1

/core

epCBA

2 4 2 4 2 4

ftCBA

2 4 2 4 2 4

isCBA

4 4 4

spCBA

2 4 2 4 2 4

mgCBA

2 4 2 4 2 4

cg

CFSPSX yield

PIN

CBA

Figure 7. UMA oversubscription MPI. Perfor-mance is normalized to that of experiments with 1task per core. Number of tasks per core can be 2 or4. Overall workload performance decreases by 10%to 18%.

OMP Nehalem

0

0.5

1

1.5

2

248 248 248 248

Perfo

rman

ce re

lative

to 1

/core

epSCBA

248 248 248 248

ftSCBA

248 248 248 248

isSCBA

248 248 248 248

spSCBA

248 248 248 248

mgSCBA

248 248 248 248

cg

CFSPSX yield

PIN

SCBA

Figure 8. NUMA oversubscription OpenMP. Per-formance is normalized to that of experiments with1 task per core. Number of tasks per core can be 2,4 or 8. Workload performance decreases by 6% to14%.

0.8  

0.9  

1  

1.1  

1.2  

0   2   4   6   8   10   12   14   16  

Rela.

ve  Pe

rform

ance  

OMP  Barcelona,  KMP_BLOCKTIME=0  DEF1/KMP1   DEF2/KMP2   DEF4/KMP4  

Figure 9. OpenMP performance with cooperative syn-chronization on Barcelona. DEF1/KMP1 stands for thedefault/kmp=0 value with one thread per core. Valuesgreater than 1 indicate performance improvement.

0.8  

0.9  

1  

1.1  

1.2  

0   2   4   6   8   10   12   14   16  

Rela.

ve  Pe

rform

ance  

OMP  Barcelona,  KMP_BLOCKTIME=inf  DEF1/INF1   DEF2/INF2   DEF4/INF4  

Figure 10. OpenMP performance with “non-cooperative” synchronization on Barcelona. DEF1/INF1stands for the default/kmp=inf value with one threadper core. Values greater than 1 indicate performanceimprovement.

Page 7: Oversubscription on Multicore Processors

al [16] also report better performance when run-ning OpenMP with a static number of threads,evenly distributed. Enabling dynamic adjustmentof the number of threads does not affect the overalltrends when oversubscribing.

The synchronization behavior of OpenMP canbe adjusted using the variable KMP BLOCKTIME.A setting of KMP BLOCKTIME=0 forces threadsto sleep immediately inside synchronization op-erations for a more cooperative behavior. Thissetting determines a slight decrease in perfor-mance when running with one thread per core,but it improves performance when oversubscrib-ing. Figure 9 presents the performance comparedto the default setting. As illustrated, the perfor-mance of the coarse grained benchmarks is notaffected, while for the fine grained benchmarkswe observe improvements as high as 10%. Fig-ure 10 presents the performance with the settingKMP BLOCKTIME=infinite, where threads areless cooperative and never sleep. This setting pro-vides best overall performance for the OpenMPimplementations.

All of our results and analysis indicate that thebest predictor of application behavior when over-subscribing is the average inter-barrier interval.Applications with barriers executed every few msare affected, while coarser grained applications areoblivious or their performance improves.

In order to gain more insight about applica-tion behavior in the presence of oversubscriptionwe have also collected data from hardware per-formance counters (cache miss, total cycles, TLB)and detailed scheduling information from the sys-tem logs (user, system time, number of thread mi-grations, virtual memory behavior). Some of thesemetrics directly capture the benefits of oversub-scription and we illustrate the observed UPC be-havior on Tigerton. Figure 5 presents the ratio be-tween the lowest and the highest amount of ob-served user time across all cores normalized to theratio for the execution with one thread per core.This measure captures the variation of core utiliza-tion with oversubscription. Oversubscription im-proves CPU utilization for FT (all classes), IS-

A and CG (A and C). Figure 6 presents the rel-ative behavior of the L2 cache, which is sharedon Tigerton. We report the normalized total num-ber of cache misses per 1000 instructions acrossall cores. The results are normalized to the exe-cution with one thread per core. Oversubscriptionimproves memory behavior for the behavior of IS-C and MG (A and B). The behavior of the remain-ing benchmarks could be explained by a combina-tion of these two metrics.

The UPC performance trends capture anotherpotential benefit of oversubscription: it decreasesresource contention and serialization of opera-tions. The benchmarks where performance im-proves (FT, IS, SP) are characterized by resourcecontention. They all contain a hand coded commu-nication phase where each thread transfers largeamounts of data from the memory space of all(FT,IS) or many (SP) other threads. This por-tion of the code is written in such a manner thatall transfers start from the same thread and pro-ceed in sequence (0, 1, 2...). Oversubscription de-creases the granularity of these contending trans-fers and allows for less serialization. In all threebenchmarks, most of the performance improve-ments occur in these particular parts of the code:this also accounts for the better CPU utilization.Note that this behavior also explains the better im-provements of UPC on Tigerton when comparedto Barcelona: Tigerton has a much lower memorybandwidth and the front-side bus is a source ofcontention. The MPI benchmarks use the same do-main decomposition as the UPC implementations,but call into tuned implementations of collectiveand scatter-gather communication. This explainsthe lower benefits of oversubscription for the MPIimplementations.

5.1 Interaction With Per-Core SchedulingFigures 3,4,7, and 8 also illustrate the impact ofscheduling decisions inside the operating system.The new Linux distributions allow two differentbehaviors for the sched yield system call. Thedefault behavior (referred to as CFS) does not sus-pend the calling thread if its quanta has not ex-

Page 8: Oversubscription on Multicore Processors

pired, while the Posix conforming implementa-tion un-schedules the caller. The former is de-signed to improve interactive behavior in desktopand server workloads, while the latter is the behav-ior assumed by the implementations of synchro-nization operations in scientific programming. Inthe figures, the bars labeled PSX show the ad-ditional performance improvements when replac-ing2 the default Linux sched yield implemen-tation with the Posix conforming implementation.The results show that Posix yield performs betterthan CFS yield and usually the performance dif-ferences increase with the degree of oversubscrip-tion. The impact on a benchmark is relatively in-dependent on the problem class and it is an indi-cation of the frequency of synchronization opera-tions present in the benchmark.

The UPC workload is less affected by thesched yield implementation than the OpenMPworkload. This behavior is explained by the finergranularity of parallelism in the OpenMP imple-mentations. The performance of OpenMP runswith eight threads per core is completely domi-nated by the impact of more cooperative yielding.The impact on the MPI workload is small.

For all implementations, runs where threads areexplicitly managed are improved less than runssubject to the default Linux load balancing. Thedetailed comparison of CFS-Load with CFS-PINis omitted for brevity. For example, the averageimprovements in the UPC workload performanceare 6% for pinned, 9% for load balancing, withranges [-7% , 35%] and [-10% , 40%] respec-tively. As explained in the next section, this be-havior is caused by the inability of the Linux loadbalancing to migrate threads after startup and ini-tial memory allocation. Based on these observa-tions all other experimental results presented arewith Posix yield.

5.2 Thread Affinity ManagementFigures 3,4,7, and 8 also present the impact ofthread affinity management on application per-

2 Behavior is controlled by writing 1 in/proc/sys/kernel/sched compat yield.

formance. The bars labeled PIN show the aver-age performance improvements when threads areevenly distributed across the available cores whencompared to the default Linux load balancing. Asexpected, the impact of affinity management ishigher for the NUMA architectures, as illustratedfor UPC by Figures 3 and 4. UPC and OpenMP aresensitive to thread affinity management regardlessof the degree of oversubscription. For example, onBarcelona UPC runtime performance improves byas much as 57%, while OpenMP performance im-proves by 31%. MPI performance is less affectedby affinity management in the presence of over-subscription. Explicit thread affinity managementalso increases performance reproducibility: runswith the default Linux load balancing exhibit avariation as high as 120%, while runs with pinnedthreads vary by at most 10%.

The performance differences are explained by acombination of load balancing behavior, NUMAmemory affinity and runtime implementation. Withthe default load balancing, threads are started onfew cores and later migrated. With first-touchmemory affinity management, pages are bound toa memory controller before threads have a chanceto migrate to an available core. OpenMP performsan implicit barrier after spawning threads; threadsmight sleep inside the barrier which determinesthe Linux load balancing to migrate threads. UPC,which is the most sensitive to affinity manage-ment, allocates the shared heap at program startupand each thread touches its memory pages. How-ever, this first touch happens before threads have achance to migrate to an idle core.

For the benchmark implementations examinedin this paper, the performance impact of threadaffinity management is an artifact of the charac-teristics of Linux thread startup and load balanc-ing, rather than of the application itself. Ensuringthat threads are directly started on the idle coreseliminates most of the effects of explicit affinitymanagement. Our experiments with different ran-dom thread to core mappings show indistinguish-able performance between pinnings. The differ-ences between the average workload performance

Page 9: Oversubscription on Multicore Processors

with any two pinnings are within few percent (wellwithin the variation for a given pinning), with amaximum for a particular (SP) benchmark of 12%.

In [11] we present a user level load balancerthat enforces an even initial thread distribution andconstrains threads to a NUMA domain, rather thana particular core. Results for the same workloadpresented here indicate that threads can freely mi-grate inside a NUMA domain without experienc-ing performance degradation. Also note that newimplementations of job spawners on large scalesystems enforce an even thread distribution at pro-gram startup. We therefore expect explicit threadaffinity management to play a smaller role in codeoptimization of scientific applications.

6. Competitive EnvironmentsWe explore two alternatives for system manage-ment in competitive environments: 1) sharing(best effort) and 2) partitioning (managed). Forthe sharing experiments, each application is al-lowed to run on all the system cores, while for thepartitioned experiments each application is givenan equal number of processors. We consider onlyfully isolated partitions, i.e. applications do notshare sockets. The application combinations wepresent (EP, CG, FT and MG) have been chosento contain co-runners with relatively equal dura-tions3and different behavior. For lack of space wedo not present a full set of experiments and con-centrate mostly on UPC and OpenMP behavior.EP is a CPU intensive application while CG, MGand FT are memory intensive. FT performance im-proves (UPC) or it is not affected by oversubscrip-tion (OpenMP), while MG performance slightlydegrades with oversubscription. EP performanceis not affected by oversubscription, while CG per-forms fine grained synchronization and perfor-mance degrades. We do not present data for IS,which runs for at most 2s and for SP which re-quires a square number of threads in UPC.

Figure 11 presents the comparative performanceof combined workloads when sharing or partition-3 On Barcelona: ep-C=21.11s, cg-B=24.426s, ft-B=9.5s andmg-C=27.64s for OpenMP.

ing the system. We plot the speedup of each appli-cation in a pair (x-axis) when sharing the systemcompared to the its performance with partitioning.In the partitioning experiment, each application isrun on eight cores, with one thread per core. In thesharing experiments, each application receives 16cores and the number of threads per core indicatedon the x-axis. The results indicate that partitioningcores between applications decreases performanceand system throughput. For example, a cg-B+cg-B OpenMP experiment on Barcelona observes a20% slowdown, while in mg-C+cg-B we observe≈ 70% improvement for mg-C and a 20% slow-down for cg-B. A cg-B+cg-B UPC experiment onBarcelona observes a 10% slowdown when shar-ing cores, while in the mg-C+cg-B experimentmg-C observes a 90% speedup and cg-B observesa 30% speedup. Overall, when each application isexecuted with one thread per core, sharing the sys-tem shows a 33% and 23% improvement for theUPC and OpenMP workloads respectively.

Oversubscription increases the benefits of shar-ing the system for the CMP architectures: theUPC improvements are 38% and 45%, while theOpenMP improvements are 46% and 28% whenrunning with two and four threads per core re-spectively. The best overall throughout is obtainedwhen sharing the system and allowing each appli-cation to run with multiple (2,4) threads per core.The results on the Nehalem SMT architecture arepresented in Figure 13. On this architecture, shar-ing the system provides identical throughput to thepartitioned runs.

Figure 12 shows the impact of sharing theBarcelona system. For each benchmark, the fig-ure plots the fraction of its performance whencompared to the dedicated run with one threadper core. When sharing we expect each bench-mark to run close to 50% of its original speed.Any fraction larger than 50% indicates a poten-tial increase in throughput. These results also givean indication of application symbiosis. For exam-ple, in the ep-C+ep-C combination each instanceruns at 50% of the original speed. The cg-B+ep-C contains a memory intensive benchmark and a

Page 10: Oversubscription on Multicore Processors

OMP Barcelona

-20

0

20

40

60

80

100

120

124 124 124 124Impr

ovem

ent o

ver p

artiti

oning

(%)

cg-Bmgftepcg

-20

0

20

40

60

80

100

120

124 124 124 124

ep-Cmgftepcg

-20

0

20

40

60

80

100

120

124 124 124 124

ft-Bmgftepcg

-20

0

20

40

60

80

100

120

124 124 124 124

mg-Cmgftepcg

Figure 11. Performance for OMP benchmarks whensharing the system compared to partitioning.

OMP Barcelona

30

40

50

60

70

80

90

100

124 124 124 124

Tim

e de

dicat

ed/sh

ared

(%)

cg-Bmgftepcg

30

40

50

60

70

80

90

100

124 124 124 124

ep-Cmgftepcg

30

40

50

60

70

80

90

100

124 124 124 124

ft-Bmgftepcg

30

40

50

60

70

80

90

100

124 124 124 124

mg-Cmgftepcg

Figure 12. Percentage of performance sharing com-pared with dedicated one per core. The benchmarks aresharing the whole system.

OMP Nehalem

-60

-40

-20

0

20

40

60

80

100

120

1 2 1 2 1 2 1 2Impr

ovem

ent o

ver p

artiti

oning

(%)

cg-Bmgftepcg

-60

-40

-20

0

20

40

60

80

100

120

1 2 1 2 1 2 1 2

ep-Cmgftepcg

-60

-40

-20

0

20

40

60

80

100

120

1 2 1 2 1 2 1 2

ft-Bmgftepcg

-60

-40

-20

0

20

40

60

80

100

120

1 2 1 2 1 2 1 2

mg-Cmgftepcg

Figure 13. Performance for OMP benchmarks whensharing the system compared to partitioning.

OMP Nehalem

30

40

50

60

70

80

90

100

1 2 1 2 1 2 1 2

Tim

e de

dicat

ed/sh

ared

(%)

cg-Bmgftepcg

30

40

50

60

70

80

90

100

1 2 1 2 1 2 1 2

ep-Cmgftepcg

30

40

50

60

70

80

90

100

1 2 1 2 1 2 1 2

ft-Bmgftepcg

30

40

50

60

70

80

90

100

1 2 1 2 1 2 1 2

mg-Cmgftepcg

Figure 14. Percentage of performance sharing com-pared with dedicated one per core. The benchmarks aresharing the whole system.

-60

-40

-20

0

20

40

60

80

100

cg-B+cg-B

cg-B+ep-C

cg-B+ft-B

cg-B+mg-C

ep-C+ep-C

ep-C+ft-B

ep-C+mg-C

ft-B+ft-B

ft-B+mg-C

mg-C+m

g-C

Impr

ovem

ent o

ver d

edic

ated

(%)

OMP+MPI Tigerton Throughput

1/2/4/

Figure 15. Throughput forOMP/MPI sharing.

-60

-40

-20

0

20

40

60

80

100

cg-B+cg-B

cg-B+ep-C

cg-B+ft-B

cg-B+mg-C

ep-C+ep-C

ep-C+ft-B

ep-C+mg-C

ft-B+ft-B

ft-B+mg-C

mg-C+m

g-C

Impr

ovem

ent o

ver d

edic

ated

(%)

OMP+UPC Barcelona Throughput kmp0

1/2/4/

Figure 16. Through-put for OMP/UPC sharing.KMP BLOCKTIME=0.

-100

-80

-60

-40

-20

0

20

40

60

80

100

cg-B+cg-B

cg-B+ep-C

cg-B+ft-B

cg-B+mg-C

ep-C+ep-C

ep-C+ft-B

ep-C+mg-C

ft-B+ft-B

ft-B+mg-C

mg-C+m

g-C

Impr

ovem

ent o

ver d

edic

ated

(%)

OMP+UPC Barcelona Throughput infinite

1/2/4/

Figure 17. Through-put for OMP/UPC sharing.KMP BLOCKTIME=inf.

Fewer threads UPC Tigerton

-40

-20

0

20

40

60

12

24

12

24

12

24

12

24

Impr

ovem

ent o

ver f

air (%

)

cg-Bmgftepcg

-40

-20

0

20

40

60

12

24

12

24

12

24

12

24

ep-Cmgftepcg

-40

-20

0

20

40

60

12

24

12

24

12

24

12

24

ft-Bmgftepcg

-40

-20

0

20

40

60

12

24

12

24

12

24

12

24

mg-Cmgftepcg

Figure 18. Relative differences in performance whenthe application is given fewer threads. x-axis presents thenumber of application and co-runner threads. (1/2 and2/4).

More threads UPC Tigerton

-40

-20

0

20

40

60

21

42

21

42

21

42

21

42

Impr

ovem

ent o

ver f

air (%

)

cg-Bmgftepcg

-40

-20

0

20

40

60

21

42

21

42

21

42

21

42

ep-Cmgftepcg

-40

-20

0

20

40

60

21

42

21

42

21

42

21

42

ft-Bmgftepcg

-40

-20

0

20

40

60

21

42

21

42

21

42

21

42

mg-Cmgftepcg

Figure 19. Relative differences in performance whenthe application is given more threads. x-axis presents thenumber of application and co-runner threads. (2/1 and4/2).

Page 11: Oversubscription on Multicore Processors

CPU intensive benchmark: cg-B runs at 60% of itsoriginal speed, while ep-C runs at 90% of its orig-inal speed. For combinations of memory intensive(mg-C+cg-B) applications, mg-C runs at 85% ofits original speed, while cg-B runs at 45%. Fig-ure 14 shows the behavior on Nehalem where weobserve a 7% overall improvement.

The UPC programs respond better to oversub-scription than OpenMP and therefore respond bet-ter to sharing the system. The overall through-put of the UPC+UPC workload is improved by20% and 27% when applications are executedwith one and two threads per core respectively.The OpenMP+OpenMP workload throughput isimproved by 9% and 24% respectively. These im-provements are relative to the performance whenapplications are run on a dedicated system withone thread per core. For reference, Figure 15shows the throughput improvements when the sys-tem is shared between MPI and OpenMP imple-mentations. We plot the speedup of an applicationcombination when sharing compared to executingeach application in sequence (max(T (A), T (B))at the given concurrency compared with T (A) +T (B) when executed with one thread per core).Any combination of programming models ex-hibits similar trends on all architectures. Fig-ures 16 and 17 show the behavior of a UPC+OpenMPworkload for different settings of KMP BLOCKTIME.All results indicate that best throughput is obtainedwhen sharing with each application running withmultiple threads per core.

Figures 11 and 12 present results for experi-ments (PIN) where thread affinity is explicitlymanaged. With the default Linux load balanc-ing, results are very noisy and throughput is low-ered when sharing cores. In particular, when us-ing the default Linux load balancing the total ex-ecution time in a shared environment is greaterthan the time of executing the benchmarks in se-quence. Runs with explicitly pinned threads notonly outperform the runs with the default Linuxload balancing, but also exhibit very low perfor-mance variability, less than 10%, compared to thehigh variability of the later, which is up to 100%.

For any given benchmark combination, sharingor partitioning, the performance variability of thetotal duration of the run (A‖B) is small, usuallyless than 10%. Partitioning provides performancereproducibility: for every benchmark pair, the per-formance variation of any benchmark in the pairis less than 10%, irrespective of co-runners. Whensharing the system, the variability of each bench-mark across individual runs is higher, up to 81%across all combinations.

There is a direct correlation between the over-subscription trends presented in Section 5 for adedicated environment and application symbiosiswhen sharing the system. The applications withcoarse grained synchronization share the systemvery well, while the applications with fine grainedsynchronization might observe some performancedegradation. For any given pair of benchmarks,if one benchmark in the pair is not affected byoversubscription, the overall throughput improvesirrespective of its co-runner behavior and the num-ber of threads per core in each application: thereis a direct correlation between behavior with over-subscription and behavior when sharing. This alsoindicates that synchronization granularity is per-haps the most important symbiosis metric and itsuggests that oversubscription could potentiallydiminish any advantages of symbiotic schedul-ing of parallel applications. Increasing the numberof threads per core in each application improvesoverall throughput. Intuitively, oversubscriptionincreases diversity in the system and decreases thepotential for resource conflicts.

Oversubscription also changes the relative or-dering of the performance of implementations. Ina dedicated environment, the NAS OpenMP im-plementations have a performance advantage overthe UPC and MPI implementations (≈ 10% −30%). Sharing reverses the relative performancetrends observed in dedicated environments, andthe shared UPC workloads provide the shortesttime to solution (≈ 10%faster).

Page 12: Oversubscription on Multicore Processors

6.1 Imbalanced SharingThe per-core scheduling mechanism (CFS) inLinux attempts to provide a fair repartition ofexecution time to the tasks sharing the core andoversubscription might provide a mechanism toproportionally allocate system resources to appli-cations.

Figures 18 and 19 present results for sharingthe system when one application is given prefer-ence and it is allowed to run with a larger numberof threads. Figure 18 presents the impact on theapplication that receives the smaller number ofthreads, while Figure 19 presents the impact onthe application with the larger number of threads.Both figures present the performance normalizedto the performance observed when both applica-tions run concurrently with one thread per core.CG performance degrades with oversubscriptionin dedicated environments and, for these experi-ments, allocating more threads to CG than to co-runners does not improve its performance. EP iscompute bound and allocating more threads de-termines a throughput increase proportional tothe thread ratio with respect to co-runners. FTwhich benefits from oversubscription observesgood throughput increases when given preference.MG also observes throughput increases, albeitsmaller than FT.

The results show a strong correlation betweenthe application behavior with oversubscription indedicated environments and the observed resultsin these scenarios. Overall, the performance ofthe applications that receive the smaller numberof threads does not degrade and for the consid-ered benchmarks we observe little (1% and -7%when receiving 33% and 20% per core share re-spectively) throughput changes when comparedto balanced sharing. The performance of applica-tions that receive a larger number of threads im-proves and we observe an overall improvementin throughput. The improvement in throughput iscorrelated to the task share received by the ap-plication, e.g. we observe 10% overall through-put improvement for two threads and 8% for fourthreads. These results are heavily skewed by the

behavior of CG. Without CG, the improvementsare 12% and 20% respectively. Our experimentsindicate that imbalanced sharing should not beconsidered for OpenMP. We plan to examine infuture work the impact of gang scheduling on theOpenMP performance in this scenario.

These trends indicate that, besides priorities andpartitioning, controlling the number of threads anapplication receives is worth exploring as a wayof controlling its system share. The magnitudeof the performance differences indicates that formodest sharing (two, three applications) and mod-est oversubscription (two, four eight threads) gangscheduling techniques might not provide large ad-ditional performance improvements for the SPMDprogramming models (UPC and MPI) evaluated.

7. DiscussionAll implementations examined in this study haveeven per-thread domain decompositions and arewell balanced. We expect the benefits of oversub-scription to be even more pronounced for irregularapplications that suffer from load imbalance.

There are several implications from our evalua-tion of sharing and partitioning. Partitioning giveseach application a share of dedicated resources atthe expense of lower parallelism. Sharing time-slices resources across applications. Performanceis determined by a combination of load balanc-ing, degree of parallelism and contention with re-spect to resource usage: CPU, caches and mem-ory bandwidth. The fact that partitioning produceslower performance than sharing indicates that forour workloads parallelism and load balance con-cerns still trump contention concerns.

We examine reference implementations com-piled with commercially available compilers. Bet-ter optimizations or applying autotuning tech-niques to these applications might produce code(better cache locality, increased memory band-width requirements) that require contention aware-ness and system partitioning. However, currentautotuning techniques improve short term local-ity and reuse and the granularity of the OS timequantum is likely to remain larger than the du-

Page 13: Oversubscription on Multicore Processors

ration of the computation blocks affected by au-totuning. We believe that better code generationtechniques will not affect the trends reported inthis study for the current system sizes: parallelismand load balancing concerns will continue to bethe determining performance factors. Determiningthe system size at which contention and schedul-ing interactions require careful partitioning is anopen research question.

While we clearly advocate running in competi-tive environments and sharing cores for improvedthroughput, the question of the proper environ-ment for code tuning remains open. The resultswith system partitioning indicate a good perfor-mance isolation between competing applicationsand seems to be the preferred alternative.

We believe that further performance improve-ments can be achieved for the fine grained ap-plications by re-examining the implementationsof the collective and barrier synchronization op-erations. The current implementations are heav-ily optimized for execution with one thread percore in dedicated environments. In [5] we presentkernel level extensions for cooperative schedul-ing on the CellBE in the presence of oversub-scription. In that particular environment, over-subscription was required for good performanceand we have extended Linux with a new systemcall sched yield to. Similar support and are-thinking of the barrier implementations mightbe able to improve the performance of oversub-scribed fine grained applications. As future workwe also plan to examine the interaction betweenoversubscription and networking behavior on largescale clusters.

The reversal of performance trends betweenUPC and OpenMP in the presence of competitionand oversubscription, indicates that these factorsmight be valuable when evaluating parallel pro-gramming models in desktop and shared serverscompetitive environments.

8. Related WorkCharm [1] and AMPI [12] are research projectsthat advocate oversubscription as an efficient way

of hiding communication latency and improvingapplication level load balancing for MPI programsrunning on clusters. They use thread virtualiza-tion and provide a runtime scheduler to multiplextasks waiting for communication. Cilk and X10provide work stealing runtimes for load balanc-ing. All these approaches provide their implemen-tation specific load balancing mechanism and as-sume execution in dedicated environments withone “meta-thread” per core. The behavior of thesemodels in competitive environments has not beenstudied. We believe that oversubscription can pro-vide an orthogonal mechanism to increase perfor-mance robustness in competitive environments.

Oversubscription for OpenMP programs is dis-cussed by Curtis-Maury [19] et al for SMT andCMP (simulated) processors. For the NAS bench-marks they report a much higher impact of over-subscription than the impact we observe in thisstudy on existing architectures and mostly advo-cate against it. For SMT, they also indicate thatsymbiosis with the hardware context co-runner isvery important. Our results using a similar work-load on Nehalem processors indicate that the de-termining performance factor is the granularity ofthe synchronization operations. The differences inreported trends might be caused by the fact that weevaluate systems with a larger number of cores.Liao et al [16] discuss OpenMP performance onCMP and SMT architectures running in dedicatedmode and consider both static and dynamicschedules. Their results indicate little difference inperformance when using different schedules.

A large body of research in multiprocessorscheduling can be loosely classified as symbioticscheduling: threads are scheduled according to re-source usage patterns. Surprisingly, there is littleinformation available about the impact of sharingor partitioning multicore processors for fully par-allel workloads. Most available studies considereither symbiotic scheduling of parallel workloadson clusters (e.g. [24] computation + I/O) or sym-biotic scheduling of multiprogrammed workloadson multicore systems. Snavely and Tullsen [22] in-troduce symbiotic co-scheduling on SMT proces-

Page 14: Oversubscription on Multicore Processors

sors for a workload heavily biased towards multi-programmed single-threaded jobs. Their approachsamples different possible schedules and assignsthreads to SMT hardware contexts. Fedorova etal [8] present an OS scheduler able to improvethe cache symbiosis of multiprogrammed work-loads. Our results indicate that for parallel scien-tific workloads in competitive environments over-subscription is a robust way to increase symbiosiswithout specialized support.

Boneti et al [6] present a scheduler for balanc-ing high performance computing MPI applicationson the POWER5 processor. Their implementationtargets SMT processors and uses hardware sup-port to manipulate instruction priorities within onecore. Li et al [15] present an implementation ofLinux load balancing for performance asymmetricmulticore architectures and discuss performanceunder parallel and server workloads. They mod-ify the Linux load balance to use the notions ofscaled core speed and scaled load but balancing isperformed based on run queue length.

Job scheduling for parallel systems has an ac-tive research community. Feitelson [2] maintainsthe Parallel Workload Archive that contains stan-dardized job traces from many large scale in-stallations. Feitelson and Rudolph [9] provide aninsightful discussion about the impact of gangscheduling on the performance of fine grained ap-plications. Their study is conducted in 1992 onthe Makbilan processor. Zhang et al [27] providea comparison of existing techniques and discussmigration based gang scheduling for large scalesystems. Gang scheduling is increasingly men-tioned in multicore research studies, e.g. [17],but without a practical, working implementation.Gang scheduling has been also shown to reduceOS jitter on large scale systems. The new operat-ing systems for the IBM BG/Q and Cray XT6 sys-tems have been announced to support dedicatedOS service cores in an attempt to minimize jitterat very large scale. Mann and Mittaly present aLinux kernel implementation able to reduce jitterby reduction of kernel threads, intelligent inter-rupt handling and manipulating hardware SMT

thread priorities. Our results provide encouragingevidence that oversubscription might provide analternate way for reducing the impact of OS jitteron the performance of scientific applications andalleviating some of the need for gang scheduling.

9. ConclusionIn this paper we evaluate the impact of execut-ing MPI, UPC and OpenMP applications with taskoversubscription. We use implementations of theNAS Parallel Benchmarks on multi-socket multi-core systems using both UMA and NUMA mem-ory systems. In addition to the default Linux loadbalancing we evaluate the impact of explicit threadaffinity management. We also evaluate the impactof sharing or partitioning the cores within a systemon the throughput of parallel workloads.

Our results indicate that oversubscription withproper support should be given real considera-tion when running parallel applications. For com-petitive environments, oversubscription decreasesthe impact of co-runners and the performancevariability. In these environments, oversubscrip-tion also improves system throughput by up to27%. For the systems evaluated, partitioning incompetitive environments reduces throughput byup to 46%. On all architectures evaluated, bestthroughput is obtained when applications share allthe cores and are executed with multiple threadsper core. The granularity of the synchronizationpresent in an application in a dedicated environ-ment is perhaps the best measure of its degree ofsymbiosis with other applications.

References[1] CHARM++ project web page. Available at

http://charm.cs.uiuc.edu.[2] Parallel Workload Archive. Available at http://

www.cs.huji.ac.il/labs/parallel/workload/.[3] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J.

Gebis, P. Husbands, K. Keutzer, D. A. Patterson,W. L. Plishker, J. Shalf, S. W. Williams, and K. A.Yelick. The Landscape of Parallel ComputingResearch: A View from Berkeley. TechnicalReport UCB/EECS-2006-183, EECS Department,University of California, Berkeley, Dec 2006.

[4] P. Balaji, R. Thakur, E. Lusk, and J. Dinan.Hybrid Parallel Programming with MPI and

Page 15: Oversubscription on Multicore Processors

PGAS (UPC). Available at http://meetings.mpi-forum.org/secretary/2009/07/slides/2009-07-27-mpi-upc.pdf.

[5] F. Blagojevic, C. Iancu, K. Yelick, M. Curtis-Maury, D. S. Nikolopoulos, and B. Rose. Schedul-ing Dynamic Parallelism On Accelerators. In CF’09: Proceedings of the 6th ACM conference onComputing frontiers, 2009.

[6] C. Boneti, R. Gioiosa, F. J. Cazorla, and M. Valero.A Dynamic Scheduler for Balancing HPC Ap-plications. In SC ’08: Proceedings of the 2008ACM/IEEE conference on Supercomputing, 2008.

[7] F. Cappello and D. Etiemble. MPI VersusMPI+OpenMP on IBM SP for the NAS Bench-marks. In Supercomputing ’00: Proceedings of the2000 ACM/IEEE Conference on Supercomputing,2000.

[8] A. Fedorova, M. I. Seltzer, and M. D. Smith. Im-proving Performance Isolation on Chip Multipro-cessors via an Operating System Scheduler. InProceedings of Parallel Architectures and Compi-lation Techniques (PACT), 2007.

[9] D. G. Feitelson and L. Rudolph. Gang SchedulingPerformance Benefits for Fine-Grain Synchroniza-tion. Journal of Parallel and Distributed Comput-ing, 16:306–318, 1992.

[10] The GWU NAS Benchmarks. Available athttp://upc.gwu.edu/ download.html.

[11] S. Hofmeyr, C. Iancu, and F. Blagojevic. LoadBalancing on Speed. To appear in Proceedings ofPrinciples and Practice of Parallel Programming(PPoPP’10), 2010.

[12] C. Huang, O. Lawlor, and L. V. Kal. AdaptiveMPI. In In Proceedings of the 16th InternationalWorkshop on Languages and Compilers for Paral-lel Computing (LCPC 03, pages 306–322, 2003.

[13] G. Krawezik. Performance Comparison of MPIAnd Three OpenMP Programming Styles OnShared Memory Multiprocessors. In SPAA ’03:Proceedings Of The Fifteenth Annual Acm Sym-posium On Parallel Algorithms And Architectures,2003.

[14] T. Li, D. Baumberger, and S. Hahn. Efficient AndScalable Multiprocessor Fair Scheduling UsingDistributed Weighted Round-Robin. In PPoPP’09: Proceedings of the 14th ACM SIGPLANSymposium on Principles and Practice of ParallelProgramming, pages 65–74, New York, NY, USA,2009. ACM.

[15] T. Li, D. Baumberger, D. A. Koufaty, andS. Hahn. Efficient Operating System Schedul-ing for Performance-Asymmetric Multi-Core Ar-chitectures. In SC ’07: Proceedings of the 2007ACM/IEEE conference on Supercomputing, 2007.

[16] C. Liao, Z. Liu, L. Huang, , and B. Chapman.Evaluating OpenMP on Chip MultiThreadingPlatforms. Lecture Notes in Computer Science.Springer Berlin / Heidelberg, 2008.

[17] R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovic,and J. Kubiatowicz. Tessellation: Space-Time Par-titioning in a Manycore Client OS. Proceedingsof the First Usenix Workshop on Hot Topics inParallelism, 2009.

[18] A. Mandal, A. Porterfield, R. J. Fowler, and M. Y.Lim. Performance Consistency on Multi-SocketAMD Opteron Systems. Technical Report TR-08-07, RENCI, 2008.

[19] Matthew Curtis-Maury and Xiaoning Ding andChristos Antonopoulos andDimitrios Nikolopou-los. An Evaluation of OpenMP on Current andEmerging Multithreaded/Multicore Processors. InProceedings of the 1st International Workshop onOpenMP (IWOMP’05).

[20] The NAS Parallel Benchmarks. Available athttp://www.nas.nasa.gov/Software/NPB.

[21] K. J. Nesbit, M. Moreto, F. J. Cazorla, A. Ramirez,M. Valero, and J. E. Smith. Multicore ResourceManagement. IEEE Micro, 28(3), 2008.

[22] A. Snavely. Symbiotic Jobscheduling For A Si-multaneous Multithreading Processor. In In EighthInternational Conference on Architectural Supportfor Programming Languages and Operating Sys-tems (ASPLOS), pages 234–244, 2000.

[23] UPC Language Specification, Version 1.0. Avail-able at http://upc.gwu.edu.

[24] J. Weinberg and A. Snavely. User-guided sym-biotic space-sharing of real workloads. In ICS’06: Proceedings of the 20th annual internationalconference on Supercomputing, 2006.

[25] S. B. Wickizer, H. Chen, R. Chen, Y. Mao,F. Kaashoek, R. Morris, A. Pesterev, L. Stein,M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: AnOperating System for Many Cores. In Proceedingsof the 8th USENIX Symposium on OperatingSystems Design and Implementation (OSDI ’08),2008.

[26] L. Xue, O. ozturk, F. Li, M. Kandemir, andI. Kolcu. Dynamic partitioning of processing andmemory resources in embedded mpsoc architec-tures. In DATE ’06: Proceedings of the conferenceon Design, automation and test in Europe, pages690–695, 3001 Leuven, Belgium, Belgium, 2006.European Design and Automation Association.

[27] Y. Zhang, H. Franke, J. Moreira, and A. Sivasub-ramaniam. An Integrated Approach to ParallelScheduling Using Gang-Scheduling, Backfillingand Migration. In IEEE Transactions on Paralleland Distributed Systems (TPDS), 2003.