exploring multi-threaded java application performance on

Exploring Multi-Threaded Java ApplicationPerformance on Multicore Hardware

Jennifer B. Sartor and Lieven EeckhoutGhent University, Belgium

AbstractWhile there have been many studies of how to schedule ap-plications to take advantage of increasing numbers of coresin modern-day multicore processors, few have focused onmulti-threaded managed language applications which areprevalent from the embedded to the server domain. Managedlanguages complicate performance studies because theyhave additional virtual machine threads that collect garbageand dynamically compile, closely interacting with applica-tion threads. Further complexity is introduced as modernmulticore machines have multiple sockets and dynamic fre-quency scaling options, broadening opportunities to reduceboth power and running time.

In this paper, we explore the performance of Java applica-tions, studying how best to map application and virtual ma-chine (JVM) threads to a multicore, multi-socket environ-ment. We explore both the cost of separating JVM threadsfrom application threads, and the opportunity to speed up orslow down the clock frequency of isolated threads. We per-form experiments with the multi-threaded DaCapo bench-marks and pseudojbb2005 running on the Jikes ResearchVirtual Machine, on a dual-socket, 8-core Intel Nehalemmachine to reveal several novel, and sometimes counter-intuitive, findings. We believe these insights are a first butimportant step towards understanding and optimizing man-aged language performance on modern hardware.

Categories and Subject Descriptors D3.4 [ProgrammingLanguages]: Processors—Memory management (garbagecollection); Optimization; Run-time environments

General Terms Performance, Measurement

Keywords Performance Analysis, Multicore, Managed Lan-guages, Java

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.OOPSLA’12, October 19–26, 2012, Tucson, Arizona, USA.Copyright c© 2012 ACM 978-1-4503-1561-6/12/10. . . $10.00

1. IntroductionMulticore systems are here to stay. Since the first multicoreprocessor was released over a decade ago, today’s proces-sors implement up to a dozen cores, and each generation isexpected to have an increasing number of cores because ofMoore’s law [19]. Multicore processors are abundant acrossall segments of the computer business, from hand-held de-vices such as smartphones and tablets, up to high-end sys-tems. Today’s servers even host multiple sockets, effectivelycreating systems with multiple tens of cores.

Given the ubiquity of multi-threaded managed languageapplications [22], now running on modern multicore hard-ware, their performance is critical to understand. While mul-ticore performance has been extensively studied for multi-threaded applications written in traditional programminglanguages, understanding the performance of applicationswritten in managed languages such as Java has received lit-tle attention. Part of the reason is the complexity of thesemanaged workloads. Java virtual machine (JVM) servicethreads such as garbage collection, profiling, and compila-tion threads, interact with application threads in many com-plex, non-deterministic ways. Not only can these threadsstop the application, e.g., in order to collect a full heap, theyalso interact with the application at the microarchitecturallevel, sharing hardware resources, such as cores, cache andoff-chip bandwidth.

Looking forward, optimizing performance while takinginto account power and energy will be even more impor-tant. Handheld devices have to minimize total energy con-sumption with given performance goals, e.g., soft real-timedeadlines. For high-end servers, the goal is typically to op-timize performance while not exceeding a given power bud-get. Several studies point out that with the end of Dennardscaling [7] — slowed supply voltage scaling — it will nolonger be possible to power on the entire processor all thetime, a problem referred to as dark silicon [9]. Intel’s TurboBoost technology [15], which enables increasing clock fre-quency for a limited number of cores and for a short durationof time, could be viewed as a form of dark silicon.

Understanding and optimizing managed application per-formance on multicore processors and systems is thereforea complicated endeavor. Some of the fundamental ques-

tions that arise are: How many application and JVM servicethreads yield optimum performance? Should one initiate asmany application and garbage collection threads as there arecores on the chip? If so, does this hold true for multi-socketsystems as well? Further, how should one distribute the ap-plication, garbage collection and compiler threads across thevarious cores and sockets in the system? In particular, incase of a multi-socket system, should one schedule JVM ser-vice threads on the same socket as the application threads,and what is the performance penalty, if any, from offloadingJVM service threads to another socket? If optimizing per-formance, but taking power into consideration, should onespeed up all threads or just the application threads? And howmuch does performance degrade by slowing down JVM ser-vice threads?

In this paper, we provide insight into Java multi-threadedapplication performance on multicore, multi-socket hard-ware. We perform a comprehensive set of experiments us-ing a collection of multi-threaded DaCapo benchmarks andpseudojbb2005 on a dual-socket 8-core Intel Nehalem sys-tem with a current version of the Jikes Research Virtual Ma-chine. We vary the number of cores and sockets, the num-ber of application and garbage collection threads, clock fre-quency, thread-to-core mapping and pinning, and heap size.

We are the first to extensively explore the space of multi-threaded Java applications on multicore multi-socket sys-tems, and we reveal a number of new, and sometimes sur-prising, conclusions:

1. Java application running time benefits significantly fromincreasing core frequency.

2. When isolating JVM collector threads onto a separatesocket from application threads, there is a cost: less than20% of performance. However, isolating the compila-tion thread is usually performance neutral, but for somebenchmarks actually improves overall performance.

3. If power-constrained, lowering the frequency of JVMservice threads costs a fraction of performance, but 3-5times less than when scaling application threads.

4. Many benchmarks achieve good performance when allthreads run on a single socket at the highest frequency,and the second socket is kept idle. However, all but onebenchmark have optimal performance when isolating thecompilation thread and lowering its frequency.

5. For our benchmarks, the best performance running onone socket is usually obtained by keeping the numberof application threads equal to the number of collectionthreads, also equal to the number of cores. With two sock-ets, it is better to pair application and collector threadstogether (unless there is a lot of inter-thread communi-cation), and almost all benchmarks benefit from increas-ing the number of application threads to be equal to thenumber of cores while setting the number of collectionthreads to half of that.

6. Pinning application and collector threads to cores doesnot always improve performance, and some benchmarksbenefit from letting threads migrate.

7. During startup time, the cost of isolating JVM servicethreads to another socket is less than at steady-state time,while lowering their frequency still deteriorates perfor-mance several times less than that of lowering the appli-cation threads.

Analyzing Java applications on modern multicore, multi-socket hardware reveals that it is difficult to follow a set ofrules that will lead to optimal performance for all applica-tions. However, our results reveal insights that will assistin identifying the right number and scheduling of applica-tion and JVM service threads that will preserve performancewhile saving the critical resource of power in future systems.

2. MotivationAs managed languages have become ubiquitous from theembedded to the server market and in between, it is im-portant to analyze their performance on multicore hard-ware. Because Java applications run on top of a virtualruntime environment, which includes its own managementthreads and introduces non-determinism for applications,choosing the correct parameters and configurations to bal-ance performance and energy is complex. Modern machinesalso introduce extra runtime parameters to study, offeringmultiple sockets and frequency (and voltage) scaling op-tions. Researchers have been studying the performance andpower consumption for managed languages as part of priorwork [4–6, 10, 11, 13], demonstrating that it is an impor-tant problem. However, studies in prior work were limitedto single-socket systems, single-threaded Java applications,and/or isolated JVM service performance, not end-to-endperformance. (See Section 5 for a more detailed descriptionof prior work.)

In this work we expand the exploration space consider-ably. We consider multi-threaded applications written in amanaged programming language, Java. We focus on multi-core, multi-socket hardware. We also aim at understandinghow design choices in terms of thread-mapping, frequencyscaling, thread-pairing, pinning, and more affect end-to-endapplication performance. More specifically, there are a num-ber of fundamental questions related to Java application per-formance on multicore multi-socket systems that merit fur-ther investigation.

Number of application and JVM service threads. First,we would like to explore the number of application and col-lection threads given a particular number of cores. Defaultmethodology sets the number of application and collectionthreads equal to the number of cores; however, this doesnot necessarily take into account the sharing of hardware re-sources between these and other JVM threads, nor does ittake into account memory access ramifications of commu-

nicating across sockets. Although current practice is to havethread-local allocation, collection threads are not tied to anyparticular application thread or core (in our Java virtual ma-chine), and can touch many areas of memory and incur inter-thread synchronization in order to reclaim space. Other JVMthreads such as those that perform dynamic compilation andon-stack replacement, also interact with application threadsand share hardware resources. It is unclear whether hardwareresource sharing between JVM and application threads actu-ally helps or hurts performance, because of either better datalocality or resource contention.

Thread-to-core/socket mapping. Inherent in the choice ofthe number of threads is where to place these threads ona multicore multi-socket system. Current systems largelyleave thread scheduling up to the operating system whichcan preempt and context switch threads when necessary.Intuition suggests we should use all cores and maximizeparallelism. Further, benchmarks with large working setscould benefit from the aggregate last-level cache capacityacross sockets. However, as previously mentioned, movingdata between sockets increases inter-thread communication,leading to more memory accesses and coherence traffic andpotentially higher synchronization costs.

Frequency scaling and power implications. Because mod-ern machines include the ability to dynamically scale thecore frequency, another axis to explore is the ramificationsof scaling on total application performance. Because poweris a first-order concern, and will become even more con-strained in future systems, it might be worth paying theprice of somewhat reduced performance for power savings.Hence, we need to analyze both the cost of moving threadsto another core or socket and then additionally, the cost oflowering the clock speed. Because JVM service threads donot run constantly, we surmise that they should be amenableto running at scaled-down frequencies without hurting appli-cation performance drastically. However, both compilationand collection can be on the application critical path if theyneed to stop the workload to perform on-stack replacementor collect garbage when the heap is full. Garbage collectionthreads have to trace all live heap pointers to identify deaddata to reclaim, and could thus suffer from higher memorycommunication if moved to a separate socket.

Analyzing this challenging experimental space will shedlight on the ramifications of the many configuration andscheduling decisions on overall goals, whether time orpower. We explore the performance of Java workloads, whilekeeping power in mind, along these many orthogonal axeson modern multicore, multi-socket hardware to provide new,sometimes surprising, findings and insights.

3. Experimental SetupBefore presenting the key results obtained from this study,we describe our experimental setup of running Java work-loads on multicore, multi-socket hardware.

!"#$%&'(%)*+,-%

.-,*/-0%)12-%3%

"45#$%&"%)*+,-%

67$%&!%)*+,-%

!"#$%&'(%)*+,-%

.-,*/-0%)12-%!%

"45#$%&"%)*+,-%

!"!"!"""

89:+;<*=,%>?=-2+1??-+=%

!"#$%&'(%)*+,-%

.-,*/-0%)12-%@%

"45#$%&"%)*+,-%

67$%&!%)*+,-%

!"#$%&'(%)*+,-%

.-,*/-0%)12-%A%

"45#$%&"%)*+,-%

!"!"!"""

((B!%7-012C%)1?=21//-2D%

89:+;<*=,%>?=-2+1??-+=%

((B!%7-012C%)1?=21//-2D%

E1+;-=%3% E1+;-=%'%

Figure 1. Diagram of our two-socket, eight-core Intel Ne-halem experimental machine, with memory hierarchy.

3.1 Hardware PlatformThe hardware platform considered in this study is an IntelNehalem based system, more specifically, an IBM x3650 M2with two Intel Xeon X5570 processors that are 45-nm West-mere chips with 4 cores each, see Figure 1. The two socketsare connected to each other through QuickPath InterconnectTechnology; and feature a 1333 MHz front-side bus. Eachcore has a private L1 with 32 KB for data and 32 KB forinstructions. The unified 256 KB L2 cache is private, andthe 8 MB L3 cache is shared across all four cores on eachsocket. The machine has a total main memory capacity of14 GB. We can use dynamic frequency scaling (DFS) onthe Nehalem to vary core frequencies between 1.596 GHzand 3.059 GHz. However, on this machine, it is only possi-ble to change the frequency at a socket-level. Therefore, allfour cores on each socket run at the same frequency, and weare unable to evaluate more than two different frequenciessimultaneously. For all of our experiments, we set the fre-quency only to the lowest and highest extremes to test thelimits of performance. We turn hyperthreading off in all ofour experiments.

3.2 Benchmarks and JVM MethodologyWe perform experiments with the Jikes Research VirtualMachine (RVM), having obtained the source from the Mer-curial repository in December, 20111. We updated from astable release of Jikes because of a revamping of their ex-perimental methodology code. We modified Jikes slightly toidentify and control JVM service thread placement for thepurpose of frequency scaling. By default, we pin applica-tion and garbage collection threads to a particular core, andother JVM service threads are placed on the socket with ap-plication threads, but not pinned to a specific core. We alsoperform experiments without pinning application and JVMthreads for comparison, and these results will be discussed inSection 4.5. In addition to threads that perform garbage col-

1 changeset 10414: 5c59ac91ff06

Benchmark Suite Heap

SizeMB(m

in)To

tal A

lloca

tion

MB

avrora DaCapo Bach 50 54lusearch DaCapo Bach 34 8152

lusearch-fix DaCapo Bach 34 1071pmd DaCapo Bach 49 385

sunflow DaCapo Bach 54 1832xalan DaCapo Bach 54 1104

pjbb2005 SPECjbb2005 200 1930

Table 1. Benchmark characteristics.

lection and compilation, other JVM service threads include athread to perform finalization and do timing, many threads tocoordinate on-stack-replacement (called organizer threads)of dynamically recompiled method code, and a main threadthat is the first to execute the application’s main method.Jikes does not perform compilation in parallel, so there isonly one compilation thread.

We perform experiments on the multi-threaded Java ap-plications in the DaCapo benchmark suite [4] version 9.12-bach that successfully run on our version of Jikes RVM.These include avrora, lusearch, pmd, sunflow, and xalan. Wealso perform experiments on pseudojbb2005 [3, 4], a vari-ant of SPECjbb2005 with a fixed number of warehousesthat allows for measuring the execution time. These bench-marks are real-world open-source applications. Table 1 de-tails which suite our benchmarks came from, their mini-mum heap size in MB and their total allocation in MB [25].We control the number of application and collection threadswith command-line flags (which are by default both set tofour). Our experiments use the best-performing, default gen-erational Immix heap configuration [2]. Although there aregarbage collection threads running in parallel, the collectoris “stop-the-world” meaning that the application is stoppedduring both nursery and whole-heap collections. We per-formed all experiments at 1.5, 2, and 3 times the minimumheap size for each benchmark. For conciseness, we presentsome graphs only with 2 times the minimum heap size. Weperform 10 invocations of each benchmark, and results arepresented as the average of these 10 runs, along with 90%confidence intervals. For steady-state execution, which is thedefault reported throughout results, we measure the 15th it-eration. Although there is non-determinism inherent in us-ing the just-in-time adaptive compilation system that detectshot methods and dynamically re-compiles them to higheroptimization levels, measuring the 15th iteration gives timefor the benchmark to reach more steady-state behavior. Forcomparison, we present two separate sections of results atstartup time, which are gathered during the first iteration ofeach benchmark. The compilation and on-stack-replacement

threads are most active during startup time, and thus we in-clude results for complete evaluation.

For targeted further analysis, we perform experimentswith a version of the lusearch benchmark updated with abug fix. The original lusearch derives from the 2.4.1 stablerelease of Apache Lucene, and has a very high allocationrate. This version included a bug that when fixed, only con-ditionally allocates a large data structure and cuts allocationby a factor of eight [25]. For additional experiments, we usea version updated with revision r803664 of Lucene that con-tains the bug fix [20], and we call this “lusearch-fix”. Weinclude results in all graphs for “lusearch-fix”; however, westill include the original lusearch because it is the officialbenchmark in the current release of the DaCapo suite.

3.3 Methodological DesignRecent hardware trends point to a need for power savingsand capping while delivering high levels of performance. Al-ready, recent research [6] has explored the effect of scalingthe frequency of service threads on power and energy, ex-plicitly targeting a heterogeneous multicore processor withbig and small cores. While this is an important step to informfuture hardware design, it is also important to focus on evalu-ating how to obtain the best performance from current multi-socket systems. Therefore, we measure end-to-end perfor-mance for multi-threaded workloads, and study the effect ofboth isolating and scaling JVM and application threads.

Because we are unable to perform per-core DFS in ourhardware setup, we employ a two-step process. We first mi-grate certain threads to a separate socket and evaluate thecost of isolating threads on running time. Subsequently, wequantify and report relative performance differences whencomparing frequency settings after isolating threads. By do-ing so, we separate the effect of scaling frequency from iso-lating threads onto another socket, and hence provide insightinto how scaling affects end-to-end performance in multi-core systems.

The percentage of time spent in JVM threads is less thanin the application threads. For this reason, we expect scal-ing down the frequency of JVM threads to have a smallerimpact on end-to-end performance than scaling applicationthreads. However, there is close interaction between applica-tion and JVM threads which makes this an interest space toexplore. In particular, because we use a stop-the-world col-lector, application threads are stopped during garbage col-lection, hence slowing down collection only by a small frac-tion may have significant impact on overall application per-formance. Likewise, just-in-time compilation threads gener-ate optimized code during runtime, and slowing these downmay cause the application to run unoptimized code for alonger period of time, which may also have significant im-pact on overall performance. Furthermore, the interactionbetween JVM and application threads because of updatingcode and managing data leads to unpredictable interferencein the memory subsystem. The interactions between appli-

cation and JVM threads are diverse and complex, and hence,we perform detailed experiments to evaluate the impact ofboth pairing and scaling the frequency of application andJVM threads.

4. ResultsWe now present our experimental results running multi-threaded Java applications on modern multicore, multi-socket hardware, in which we vary the number of cores andsockets, the number of application and garbage collectionthreads, clock frequency, thread-to-core mapping and pin-ning, and heap size. Because of the complexity of the space,we break up the analysis in a number of comprehensivesteps. We first discuss the general effect frequency scalinghas on application execution time. We then analyze the costof isolating some JVM threads to a separate socket. Afterisolating JVM threads, we analyze the cost of scaling thefrequency of isolated JVM service threads from the high-est to the lowest value on end-to-end performance. Becausewe see a significant cost from isolating garbage collectionthreads, we then perform experiments that pair applicationand collection threads, but put some pairs on each socket.Finally, we analyze the effect that thread pinning has on per-formance, and the effect of varying both the number of ap-plication and collector threads while running on one socket.

4.1 The Effect of Scaling FrequencyAlthough a detailed power study is beyond the scope ofthis paper, we first wanted to explore the effect core fre-quency has on application performance. Figure 2 presentsthe speedup as a percentage of execution time of boost-ing core frequency from the lowest to the highest, when allthreads, including four collector threads, run on only onesocket. Results are for all six benchmarks for our range ofthree heap sizes.

Finding 1. Java workloads benefit significantly from scalingup the clock frequency.

Doubling clock frequency leads to between 27% and 50%performance improvement. Results are not sensitive to heapsize. We see that doubling the clock speed does not leadto doubling the performance improvement, which would be100%. Our benchmarks fall short of perfect scaling, mostlikely because of inter-thread synchronization and memoryintensity. However, as we will see in all of our results, corefrequency is one of the most significant factors in determin-ing, and improving, application time. Below, we will showthat scaling down JVM thread frequency does not affect per-formance as drastically as for application threads.

4.2 The Cost of IsolationWe now analyze the performance penalty inherent to iso-lating JVM service threads to another socket. We investi-gate this cost for four collector threads, for the compilationthread, for all JVM service threads except collector threads,

0

10

20

30

40

50

60

% Im

pro

vem

ent in

Exe

c T

ime

Boost Freq, 1 Socket

1.5xheap

2xheap

3xheap

avr

ora

luse

arc

h

luse

arc

h-f

ix

pm

d

sun

flow

xala

n

pjb

b2

00

5 %

Im

pro

vem

ent in

Exe

c T

ime

Figure 2. Percent execution time improvement when boost-ing frequency from lowest to highest on one socket.

and then all JVM threads together. The motivation for thisexperiment is twofold. First, in order to investigate the fea-sibility of scaling down the frequency of only JVM servicethreads, we must first isolate threads onto a separate socket,because we are only able to scale frequency at the socket-level. Second, we want to study how isolating JVM servicethreads to another socket hurts (through reduced data local-ity) or helps performance (by getting off the application’scritical path).

Isolating garbage collection threads. Figures 3 through 6show the cost of isolating some JVM threads to a secondsocket, as compared with the execution times when run-ning all threads on one socket. The graphs present steady-state performance differences for our three heap sizes run-ning at both the highest and lowest core frequencies. Whilesome benchmarks’ performance varies with heap size, wesee larger performance differences between high and lowcore frequencies.

Finding 2. Isolating garbage collection threads to a sep-arate socket leads to a small performance degradation (nomore than 17%) for most benchmarks because of increasedlatency between sockets; however, one benchmark substan-tially benefits (up to 66%) from increased cache capacity.

Figure 3 shows that all but one application suffer fromisolating four collection threads, due to more data commu-nication between sockets. For all but lusearch, the degrada-tion is less than 5% for the lowest frequency, and less than17% for the highest. Lusearch is an outlier that particularlysuffers from both increasing the number of collector threads(as shown in Figure 24) and from isolating those threads toanother socket, with over 40% degradation in performance.

avr

ora

-60

-40

-20

0

20

40

60

80

% Im

pro

vem

ent in

Exe

c T

ime

Isolate GC

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 3. Percent execution time improvement of isolatingfour collector threads to a second socket.

avr

ora

-120

-100

-80

-60

-40

-20

0

20

40

% Im

pro

vem

ent in

Exe

c T

ime

Isolate Compiler

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 4. Percent execution time improvement of isolatingthe compiler thread to a second socket.

Figure 3 shows that when lusearch does not suffer from hugeamounts of allocation, lusearch-fix’s cost of moving collec-tor threads to another socket lowers to be in line with otherbenchmark trends.

Interestingly, avrora benefits from isolating the collectorthreads to another socket. Performance improves by 36%and 66% at the lowest and highest frequencies, respectively.However, it should be noted that the confidence intervalsfor avrora are large, and thus performance greatly variesfrom run to run. For avrora, running all application andcollector threads on one socket makes the threads contendmore for the cache, and avrora benefits from the increasedcache capacity of two sockets. Analyzing hardware perfor-mance counters revealed fewer L3 misses when the collectorthreads were isolated. We will see later that avrora is particu-larly sensitive to application-thread to core mapping becauseapplication threads do not have uniform behavior, and sharedata (see analysis in Sections 4.4 and 4.5).

avr

ora

-120

-100

-80

-60

-40

-20

0

20

% Im

pro

vem

ent in

Exe

c T

ime

Isolate nonGC

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 5. Percent execution time improvement of isolatingall JVM non-collector threads to a second socket.

Isolating the JIT compilation thread. Figure 4 shows theeffect of isolating just the compilation thread, with fourcollector and four application threads that are pinned to thefirst socket.

Finding 3. Isolating the compilation thread to a sepa-rate socket leads to a either a performance boost, or isperformance-neutral. Only avrora’s performance at highfrequencies suffers because of increased latency betweensockets.

Four benchmarks have very little change to performancewhen the compiler thread is isolated to a separate socket dur-ing steady-state. Lusearch and pjbb2005 see a performancewin, especially at the higher frequency, by isolating the com-piler away from other application and collector threads. Onlyavrora sees a performance hit, up to 100% with the highestfrequency (although confidence intervals are large). It is pos-sible that when the application is sped up, it is more sensi-tive to the compiler being separated from other JVM threadssuch as those that perform on-stack-replacement. When an-alyzing startup time, indeed avrora is the only benchmarkfor which performance degrades when isolating the compileror on-stack-replacement threads (see further analysis in thissection regarding Figures 8 and 9).

Isolating all JVM service threads. Figure 5 shows thecost of isolating all JVM service threads except for the fourcollection threads (or nonGC) which remain on the firstsocket with application threads. Figure 6 then shows theimpact to performance when all JVM threads are isolatedonto another socket.

Finding 4. Isolating all JVM service threads to a separatesocket leads to larger performance degradation for a fewbenchmarks (only one suffers more than 22% degradation,and only at the highest frequency), while others are onlyslightly negatively affected by offloading computation andmemory to another socket.

avr

ora

luse

arc

h

-120

-100

-80

-60

-40

-20

0

20

% Im

pro

vem

ent in

Exe

c T

ime

Isolate JVM

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 6. Percent execution time improvement of isolatingall JVM threads to a second socket.

Interestingly, while avrora benefited from collector threadisolation, performance severely degrades when isolating allnonGC threads. At the higher frequency, performance degra-dation goes down to 107%! Comparing this to isolating allJVM service threads, we see that avrora still suffers, but less,with a maximum of only 88% degradation. Avrora is moresensitive to frequency changes in combination with isola-tion than other benchmarks, and is more sensitive to nonGCthreads being isolated, probably because of memory interac-tion. The benefit avrora obtained from offloading the collec-tor memory activity to another socket is shown in the differ-ence between the nonGC and JVM isolation graphs.

Besides avrora, other benchmarks show smaller degra-dations to performance when isolating nonGC, or all JVMservice threads. For nonGC thread isolation, pmd has up to12% performance loss, but other benchmarks either slightlyimprove or slightly degrade performance. When isolating allJVM service threads, lusearch can suffer over 40%, but thefixed version of lusearch lowers this cost to less than 20%.In general, isolating all JVM threads seems to have a perfor-mance impact that is the sum of isolating collection threadsand isolating nonGC threads (Figures 3 and 5), which isoverall slightly degraded. Pmd can suffer as much as 22%when isolating JVM threads, which we investigate furtherby isolating certain JVM service threads. We isolate onlythe main thread, which calls the application’s main method,and then only the organizer thread, which performs on-stackreplacement. Results are shown in Figure 7 on the left-mostbars, and we see that isolating each of these two JVM threadsmakes pmd’s performance suffer, more so at the higher fre-quency.

Overall, we see that benchmarks respond differently tothe isolation of particular JVM service threads. While avrorabenefits from separating collection threads, all other bench-marks suffer up to 17%. However, other benchmarks benefitslightly or are neutral to isolating the compilation thread.All benchmarks but avrora do not suffer much from isolat-

-20

-15

-10

-5

0

5

10

15

% I

mp

rove

me

nt

in E

xec

Tim

e

Isolate & Lower Freq, 1 JVM Thread

pmd-Main

pmd-Organizer

pjbb2005-Main

Iso

late

-Lo

Iso

late

-Hi

Lo

we

rFre

q

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 7. Percent execution time improvement of isolatingand scaling the frequency of certain JVM threads for pmdand pjbb2005, at 1.5 times the minimum heap size.

avr

ora

-120

-100

-80

-60

-40

-20

0

20

40

% Im

pro

vem

ent in

Exe

c T

ime

Isolate Compiler Startup

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 8. Percent execution time improvement of isolatingthe compiler thread to a second socket at startup time.

ing nonGC threads. While the cost to isolate JVM servicethreads is not insignificant, it is not unreasonable if the needfor power savings is paramount.

Isolation during startup. Although application perfor-mance matters most during steady-state execution, the com-pilation thread in particular is most active during JVMstartup time. Thus, for completeness, we analyze the costof isolating the compiler thread also at startup time.

Finding 5. During startup time, isolating only the compi-lation thread has little impact on performance, but isolat-ing all nonGC threads actually improves performance forall benchmarks but avrora. In general, the performance ofisolating JVM threads to another socket is better, with someperformance improvements, at startup time than at steady-state time.

Figure 8 presents the cost of isolating the compiler threadat startup time (relative to a baseline run also at startup time),during the first iteration of a benchmark when the compiler

avr

ora

-120

-100

-80

-60

-40

-20

0

20

40

% Im

pro

vem

ent in

Exe

c T

ime

Isolate Organizer Startup

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 9. Percent execution time improvement of isolat-ing the organizer (on-stack replacement) thread to a secondsocket at startup time.

avr

ora

-120

-100

-80

-60

-40

-20

0

20

% Im

pro

vem

ent in

Exe

c T

ime

Isolate nonGC Startup

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheap

luse

arc

h

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 10. Percent execution time improvement of isolatingall JVM non-collector threads to a second socket at startuptime.

is most active. We see all benchmarks but avrora have neg-ligible impact on performance when the compiler thread isplaced onto another socket. Avrora, which has high varia-tion between different runs, sometimes sees degradation ofperformance and sometimes improvement. Although duringsteady-state, in Figure 4, lusearch and pjbb2005, see an im-provement in running time due to separating the compila-tion thread, at startup, there is not a big impact on perfor-mance. There is more communication and memory sharingbetween the JVM and application threads during actual com-pilation activity that does not exist during steady-state, andthus dampens the benefit of using extra CPU and memoryresources during startup time. However, with avrora duringstartup, we do not see the large hit to performance at highfrequencies that we did during steady-state, therefore con-cluding that avrora is less sensitive to frequency changes atstartup time.

avr

ora

-120

-100

-80

-60

-40

-20

0

20

% Im

pro

vem

ent in

Exe

c T

ime

Isolate JVM Startup

Lo-1.5xheap

Lo-2xheap

Lo-3xheap

Hi-1.5xheap

Hi-2xheap

Hi-3xheapluse

arc

h

% Im

pro

vem

ent in

Exe

c T

ime

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

pjb

b2005

% Im

pro

vem

ent in

Exe

c T

ime

Figure 11. Percent execution time improvement of isolatingall JVM threads to a second socket at startup time.

In Figures 10 and 11, we redo the experiments isolatingnonGC and JVM threads at startup time. At startup time,isolating all JVM threads except the collector threads leadsto performance improvements for all but avrora, includingup to 16% for pjbb2005. This contrasts with the neutralor slightly negative (especially for avrora and pmd) resultswe saw at steady-state time. Apparently during startup, thenonGC threads benefit from the extra resources of anothersocket, and do not suffer from extra communication be-tween threads via memory. We see similar trends for all JVMthreads, which, at startup time, suffer less from isolation thanat steady-state time, even leading to performance improve-ments for xalan and pjbb2005.

Because there is a noticable difference between thestartup performance of isolating just the compiler thread,and all nonGC JVM threads, we perform experiments alsoisolating only the organizer threads that perform on-stack-replacement. Figure 9 shows that the organizer threads playthe main part in the performance of nonGC threads. Whenorganizer threads are isolated during startup time, all bench-marks but avrora see a performance benefit, following verysimilar trends to Figure 10, with pjbb2005 improving onlyslightly less. Although organizer threads interact with appli-cation memory, they benefit from being isolated to anothersocket with extra resources. In fact, the organizer threadshave a larger impact on performance than the singular com-pilation thread itself. The difference between both avrora andpjbb2005’s performance when isolating only the organizerthread and isolating all nonGC threads is due to the mainthread (see pjbb2005 in Figure 7), and other JVM servicethread interaction.

4.3 Analyzing Frequency ScalingAfter exploring the cost of isolating JVM threads, we nowanalyze the cost of scaling each socket’s clock frequencyfrom the highest to the lowest in regards to execution time.In Figure 12 to 15, we compare against a baseline of separat-

App-1

.5

App-2

App-3

-160

-140

-120

-100

-80

-60

-40

-20

0

20

% I

mp

rove

me

nt

in E

xec

Tim

e

Lower Freq: GC vs App

avrora lusearch lusearch-fix pmd sunflow xalan pjbb2005

GC

-1.5

GC

-2

GC

-3

% I

mp

rove

me

nt

in E

xec

Tim

e

-315

Figure 12. Percent execution time improvement when low-ering frequency from highest to lowest, either four collectorthreads, or application (plus the rest) socket.

ing some JVM threads to a second socket, with both socketsat the highest frequency. The left groupings of bars comparea configuration lowering only the isolated JVM threads’ fre-quency, while the right groupings compare lowering only theapplication and non-isolated threads’ frequency. We presentall results at all three heap sizes for all benchmarks, but notethat there is little heap-size variation.

Finding 6. On average, lowering the frequency of collectorthreads does degrade performance (usually less than 20%),but degrades about five times less than lowering applicationthread frequency.

Figure 12 compares scaling down the frequency of fourcollector threads in the left three bar groupings versus ap-plication threads on the right. Although collector threadscan be on the application critical path because they forcethe application to pause during collection, collector threadsdo not run all the time, and thus they are amenable to be-ing scaled down for more power-conscience environments.Maximally, avrora performance degrades 69% for scalingcollector threads (with large confidence intervals), with thenext highest benchmark degradation at 20%. After scalingapplication threads, we see avrora’s performance can de-grade up to 315%!

Finding 7. Lowering the core frequency for the isolatedcompiler thread affects performance very little, while appli-cation performance suffers greatly.

Figure 13 shows that lowering the core frequency for onlythe compiler thread does not affect steady-state performanceon average. In Figure 16, we show that at startup time, thecompilation thread is also unaffected by lowering frequency.In comparison, benchmark application threads can degradeby as much as 100% when we scale down frequency atsteady-state time. Interestingly, avrora is the only applica-

App-1

.5

App-2

App-3

-120

-100

-80

-60

-40

-20

0

20

% I

mp

rove

me

nt

in E

xec

Tim

e

Lower Freq: Compiler vs App

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

Com

pile

r-1.5

Com

pile

r-2

Com

pile

r-3

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 13. Percent execution time improvement when low-ering frequency from highest to lowest, either compilerthread, or application (plus the rest) socket.

App-1

.5

App-2

App-3

-120

-100

-80

-60

-40

-20

0

20 %

Im

pro

vem

en

t in

Exe

c T

ime

Lower Freq: nonGC vs App

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

nonG

C-1

.5

nonG

C-2

nonG

C-3

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 14. Percent execution time improvement when low-ering frequency from highest to lowest, either JVM non-collector threads, or application (plus the rest) socket.

tion that does not see a performance degradation when theapplication and other threads are scaled down together.

Finding 8. If worrying about a power budget, scaling downthe frequency of JVM threads, while costing some perfor-mance (usually less than 20%), has a much more reasonableeffect on overall execution time as compared to scaling ap-plication threads, which can take twice the running time.

Looking at the cost of scaling all but collector threadsin Figure 14, and scaling all JVM service threads in Fig-ure 15, we see that sunflow’s application threads suffer themost, more than 90%, from running at a lower clock speed,while JVM threads’ performance degrades less than 30% forall benchmarks. It is interesting to note the change in per-formance for scaling pjbb2005’s application threads down,between the configuration where the compilation thread is

App-1

.5

App-2

App-3

-120

-100

-80

-60

-40

-20

0

20

% I

mp

rove

me

nt

in E

xec

Tim

e

Lower Freq: JVM vs App

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

JVM

-1.5

JVM

-2

JVM

-3

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 15. Percent execution time improvement when low-ering frequency from highest to lowest, either JVM threads,or application socket.

isolated (above 80% in Figure 13), and when nonGC threadsare isolated (less than 20%). We see similar, but less drastic,trends for pmd. Both benchmarks see more of a performancehit when isolated nonGC threads’ frequency is scaled downon the left in Figure 14. The difference in performance isdue to the JVM main service thread, as analyzed with perfor-mance counters, which is grouped with the application whenthe compiler is isolated and with nonGC threads when theyare isolated. Figure 7 shows that the cost of lowering the fre-quency of the isolated main JVM thread is quite significantfor pmd and pjbb2005.

On average, scaling JVM threads degrades performanceby around 11% in comparison with 50% for applicationthreads. As we see in Figure 15, lusearch-fix, pmd, sun-flow, and xalan suffer the most from lowering applicationfrequency, which are the same benchmarks that benefittedthe most from boosting frequency in Figure 2. On the otherhand, lusearch, pmd, and pjbb2005 suffer the most for scal-ing down JVM threads. Avrora, while having large varia-tion in results, does not suffer as much degradation for eitherthe application or the JVM threads when their frequency isscaled down.

Frequency scaling during startup. Here, we study the ef-fects of frequency scaling during startup time, and juxtaposethat with the previous results for steady-state performance.

Finding 9. Scaling the frequency of JVM threads at startuptime follows similar trends as scaling at steady-state time.During startup, lowering the frequency of JVM threads de-grades performance three times less than when scaling downapplication thread frequency.

Figure 16 shows the change in execution time when wescale down the frequency, from highest to lowest, of the iso-lated compilation thread versus all other threads scaled downon the other socket at startup time. This graph is very simi-lar with Figure 13, showing that although avrora experiences

Ap

p-1

.5

Ap

p-2

Ap

p-3

-120

-100

-80

-60

-40

-20

0

20

% Im

pro

vem

ent in

Exe

c T

ime

Lower Freq Startup: Compiler vs App

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

Co

mp

iler-

1.5

Co

mp

iler-

2

Co

mp

iler-

3

% Im

pro

vem

ent in

Exe

c T

ime

Figure 16. Percent execution time improvement when low-ering frequency from highest to lowest, either compilerthread, or application (plus the rest) socket, at startup time.

App-1

.5

App-2

App-3

-120

-100

-80

-60

-40

-20

0

20 %

Im

pro

vem

en

t in

Exe

c T

ime

Lower Freq Startup: nonGC vs App

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

nonG

C-1

.5

nonG

C-2

nonG

C-3

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 17. Percent execution time improvement when low-ering frequency from highest to lowest, either JVM non-collector threads, or application (plus the rest) socket, atstartup time.

performance variation, all other benchmarks are unaffectedby scaling compilation thread frequency, and suffer heavily(around 80% or more) when the other socket’s frequency isscaled down.

When scaling the frequency of nonGC and all JVMthreads, Figures 17 and 18 show that startup time has verysimilar effects as during steady-state execution. When plac-ing the application and collector threads at a lower fre-quency, lusearch and pjbb2005 see a slightly worse degrada-tion for startup time as compared with steady-state time inFigures 14 and Figure 15. These benchmarks might be moresensitive to changes at startup time because of higher levelsof dynamic compilation. Avrora alone seems to be largelyunaffected by frequency scaling at startup time. Overall,lowering the frequency of all JVM threads at startup time

App-1

.5

App-2

App-3

-120

-100

-80

-60

-40

-20

0

20

% I

mp

rove

me

nt

in E

xec

Tim

e

Lower Freq Startup: JVM vs App

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

JVM

-1.5

JVM

-2

JVM

-3

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 18. Percent execution time improvement when low-ering frequency from highest to lowest, either JVM threads,or application socket, at startup time.

1

1.5

2

2.5

3

3.5

4

4.5

5

No

rma

lize

d E

xecu

tion

Tim

e

Performance Comparison

avrora

lusearch

lusearch-fix

pmd

sunflow

xalan

pjbb2005

1S

ock

et-

Lo

Isola

teG

C-L

o

Isola

teC

om

p-L

o

Isola

teN

onG

C-L

o

Isola

teJV

M-L

o

HiA

pp-L

oG

C

HiA

pp-L

oC

om

p

HiA

pp-L

oN

onG

C

HiA

pp-L

oJV

M

LoA

pp-H

iGC

LoA

pp-H

iCom

p

LoA

pp-H

iNonG

C

LoA

pp-H

iJV

M

1S

ock

et-

Hi

Isola

teG

C-H

i Is

ola

teC

om

p-H

i Is

ola

teN

onG

C-H

i Is

ola

teJV

M-H

i

No

rma

lize

d E

xecu

tion

Tim

e

Figure 19. Execution time comparison normalized to min-imum with different thread mappings and scaling frequen-cies, at twice the minimum heap size.

degrades performance on average by 16%, and with appli-cation threads 47%. While these degradations are slightlyless than steady-state time’s degradations of 10% and 50%,respectively, we observe the same trends.

Overall best performance. Figure 4.3 compares overallexecution times of all benchmarks as we move from lowerto higher frequencies (left to right) and explore isolatingvarious JVM threads, and scaling the frequency of eitherthe isolated or application and other threads. These resultspresent runs for two times the minimum heap size. Whereasprevious graphs gave a percentage improvement in executiontime over a baseline run, we present here running times

normalized to the minimum time over this set of experiments(so lower is better) in order to analyze trends.

Finding 10. When power-constrained in a multi-socket en-vironment, it is better to either keep application and JVMservice threads on one socket, and power down the othersocket(s), or to isolate the compilation thread onto the sec-ond socket and lower its frequency.

Figure 4.3’s left grouping shows performance when allsockets are run at the low frequency. Moving right to thesecond grouping, we boost the frequency of the applicationand non-isolated threads. Performance generally improves(closer to one on the graph), but not always for avrora. Al-most universally, going from the second to the third group-ing, now keeping the first socket at the lowest frequencyand boosting only the isolated JVM threads on the secondsocket, running time increases. Avrora shows similar trends,but unlike other benchmarks, achieves the lowest perfor-mance when collection threads are separated from applica-tion threads and application threads are boosted. We sur-mise avrora has a lot of inter-application communicationand collection threads interfere with cache and bandwidthresources. Finally, the last grouping shows overall improve-ments when we boost the frequency of both sockets. For allbenchmarks but avrora, the fastest run times come from ei-ther the configuration where the compilation thread is iso-lated and the other socket is boosted (HiApp-LoComp), orthe configuration also with the compilation thread isolated,but both sockets boosted (IsolateComp-Hi). For benchmarksexcept for avrora, lusearch, and pjbb2005, the configurationwith all threads running together on one socket at the highestfrequency (1Socket-Hi) performs almost optimally as well.

4.4 Pairing Application and Collector ThreadsBecause our benchmarks have up to 17% performancedegradation from isolating collector threads to another socket,here we explore the effect of splitting work between socketswithout separating all application from collection threads. Inthis section, we pair an application and a collection threadand place half of the pairs on each socket.

Figure 20 presents results for running two application andtwo collection threads on each socket, with all applicationand collection threads pinned to cores. The graph shows bothrunning all threads on one socket, and isolating collectionthreads to a second socket versus this paired-and-dividedconfiguration. We present results at high and low frequenciesfor two times the minimum heap size.

Finding 11. Other than the anomalous avrora benchmarkwhich likely has high levels of inter-thread communication,applications benefit more from pairing collector threads to-gether with application threads while running on multiplesockets, with performance comparable to running on onlyone socket.

avr

ora

-200

-180

-160

-140

-120

-100

-80

-60

-40

-20

0

20

40

% I

mp

rove

me

nt

in E

xec

Tim

e

1 & Isolated Sockets vs Paired & Divided

OneSocket-Lo OneSocket-Hi Isolate-Lo Isolate-Hi

luse

arc

h

luse

arc

h-f

ix

pm

d

sun

flow

xala

n

pjb

b2

00

5

% I

mp

rove

me

nt

in E

xec

Tim

e

-221 -464

Figure 20. Percent execution time improvement going fromall threads on one socket and from isolating four collectorthreads on a second socket to pairing application and collec-tor threads and placing half each socket, at twice the mini-mum heap size.

In comparison with all threads on one socket (left bars),lusearch, lusearch-fix, pmd, sunflow, and xalan have almostthe same running time when pairing and dividing applica-tion and collection threads. At the highest frequency, sun-flow degrades performance by 9% because of more com-munication through memory, while pjbb2005 has improvedperformance by 15% by using twice the last-level cacheas with one socket. Unfortunately, although avrora benefit-ted from separating collector threads, dividing applicationthreads costs up to 220% of performance. Upon further in-vestigation, unhalted cycle and L3 miss performance coun-ters on the Nehalem reveal that avrora’s application threadshave non-uniform behavior. Some run many more cycles andincur more last-level cache misses than others. It is also pos-sible avrora’s application threads have significant data shar-ing, because the application threads suffer many more L3misses when divided between sockets.

In comparison with isolating all collection threads fromapplication threads (right bars), benchmarks mostly see pos-itive impacts to performance when pairing and dividing ap-plication and collection threads. Again, avrora suffers heav-ily from dividing up application threads — this time by max-imally 460%, albeit with large confidence intervals. Otherbenchmarks either improve (lusearch by 29% and pjbb2005by 17%) or maintain performance by preserving some local-ity between application and collector threads.

We explore increasing the number of application and col-lection threads using the same paired-and-divided method-ology. Using four application and four collector threads asa baseline, Figure 21 presents performance improvementsfor two times the heap size. We first increase the numberof application threads to eight while keeping four collectorthreads, and then increase both to eight threads.

Finding 12. When running on two sockets, surprisingly, it isnot always recommended to set the number of threads equal

luse

arc

h

pjb

b2005

-200

-150

-100

-50

0

50

100

% Im

pro

vem

ent in

Exe

c T

ime

Vary #Threads vs 4App/4GC, 2 Sockets

8A-4GC-Lo 8A-4GC-Hi 8A-8GC-Lo 8A-8GC-Hia

vrora

luse

arc

h-f

ix

pm

d

sunflo

w

xala

n

% Im

pro

vem

ent in

Exe

c T

ime

-295

Figure 21. Percent execution time improvement when pair-ing application and collector threads and placing half oneach socket, varying the number of application and collec-tor threads, at twice the minimum heap size.

to the number of cores. All but one benchmark benefit fromsetting the number of application thread to the number ofcores, but only two of our benchmarks benefit from boostingthe number of collection threads above four.

The graph shows that almost all benchmarks improveperformance by having as many application threads as cores(eight), but few benefit from increasing to eight collec-tion threads. Specifically, avrora and sunflow benefit fromusing more application threads, and are less sensitive tothe number of collector threads, improving performanceby 95% and 34-40%, respectively. Other benchmarks de-grade performance when going from four to eight collectorthreads. Lusearch’s performance degrades significantly (upto 295%), but with the allocation bug-fix, lusearch-fix doesnot experience the large degradation when going to eightcollection threads. Pseudojbb2005 alone has significant per-formance degradation with more application threads, around100%. This degradation could be due to not enough workto keep application threads busy, and increases in thread-synchronization time (particularly with the main thread). InSection 4.6 we analyze the effect of varying the numberof application and collection threads on one socket, but theoptimal configuration highly depends on the benchmark.

4.5 The Effect of PinningBecause all previous experiments were performed while pin-ning application and collection threads to cores, this sectionexplores the effects of removing some thread-pinning on per-formance. The experiments presented in this section have allthreads placed on one socket and are for two times the min-imum heap size. Figure 22 compares pinning only the col-lector thread, only the application threads, and pinning nothreads against a baseline of pinning all application and col-lector threads.

avr

ora

pjb

b2

00

5

-20

0

20

40

60

80

100

% I

mp

rove

me

nt

in E

xec

Tim

e

Evaluating Pinning, 1 Socket

PinGC-Lo PinGC-Hi PinApp-Lo PinApp-Hi NoPin-Lo NoPin-Hilu

sea

rch

% I

mp

rove

me

nt

in E

xec

Tim

e

luse

arc

h-f

ix

pm

d

sun

flow

xala

n

% I

mp

rove

me

nt

in E

xec

Tim

e

-78

Figure 22. Percent execution time improvement when pin-ning only collection threads, only application threads, orwith no pinning, as compared with pinning both, on onesocket at twice the minimum heap size.

Finding 13. Most benchmarks are neutral to applicationand collector thread pinning, but a few benefit from threadmigration.

Apart from avrora, Figure 22 shows that other bench-marks are mostly insensitive to pinning, with small exeuc-tion time improvements from not pinning the collector(PinApp). Lusearch-fix and xalan perform slightly worsewhen not pinning the application (PinGC). Pjbb2005 aloneclearly benefits from thread movement, and more at thehigher frequency (38%) It is possible that, like avrora,pjbb2005 has more sharing between application threads, andbecause collection threads are not assigned work based on aparticular application thread, they, too, benefit from movingto take advantage of sharing at higher levels of the cache.

Avrora is again an anomaly, having significant benefit,up to 89%, from not pinning application threads. We havesurmised that avrora application threads could have a lotof data sharing, and have discussed that they have non-uniform behavior. We performed an experiment to test ifavrora’s large performance discrepancy between pinning andnot pinning threads has to do with using only one of thetwo memory controllers on the socket. We re-executed thebaseline experiment, pinning all application and collectionthreads on one socket, but this time starting pinning at coreone instead of core zero (still subsequently using round-robin scheduling). With this small change, as compared withstarting at core zero, avrora’s execution time improves by 37and 46% at the low and high frequencies, respectively. Thissurprising result reveals that avrora is particularly sensitiveto how application threads are mapped to cores becauseperturbations radically change memory subsystem behavior.Other benchmarks did not suffer from the same artifact.Therefore, avrora is a particularly anomalous benchmark

-100

-80

-60

-40

-20

0

20

40

60

80

100

% I

mp

rove

me

nt

in E

xec

Tim

e

Vary #Threads vs 4App/4GC, 1 Socket

avrora lusearch lusearch-fix pmd sunflow xalan pjbb2005

4A

-1G

C-L

o

4A

-1G

C-H

i

3A

-3G

C-L

o

3A

-3G

C-H

i

5A

-5G

C-L

o

5A

-5G

C-H

i

8A

-4G

C-L

o

8A

-4G

C-H

i

% I

mp

rove

me

nt

in E

xec

Tim

e

Figure 23. Percent execution time improvement when vary-ing the number of application and collector threads, on onesocket at twice the minimum heap size.

that prefers grouping application threads on the same socketand letting them migrate between cores, and separating onlycollection threads to another socket.

4.6 Changing the Number of Collector andApplication Threads

Finally, we explore changing the number of application andcollector threads, limiting experiments to one socket. Ourbaseline experiment is always with four application and fourcollection threads. We first analyze the effect of decreasingthe number of collector threads while keeping the applica-tion threads at four. Then, keeping the same number of ap-plication and collector threads, we vary the number betweenthree and five. And finally, we compare increasing the num-ber of application threads from four to eight while holdingthe collector threads at four. All results are presented in Fig-ure 23 at two times the minimum heap size.

Finding 14. When running on only one socket, the per-formance sweet-spot is setting the number of applicationthreads and the number of collection threads equal to thenumber of cores.

Figure 23 shows that almost all benchmarks suffer slightlyor remain neutral from lowering the number of collectionthreads from four to one, with four application threads.Avrora in particular has degraded performance. Only luse-arch benefits, but this is due to its excessive allocation aslusearch-fix is performance-neutral. Because parallel col-lector threads steal work from a list of pointers to identifylive data and are not accessing only core-local data, lusearchcould be a pathological case where the collector threadsare doing more coordination than useful work. We investi-gated the performance for lusearch and lusearch-fix goingfrom one to four collector threads in Figure 24, with timesnormalized to lusearch with one collector thread (at each re-spective frequency), and lower numbers representing better

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

No

rma

lize

d E

xecu

tion

Tim

e

Lusearch, Vary from 1GC

Lusearch-Lo

Lusearch-Hi

Lusearch-fix-Lo

Lusearch-fix-Hi

1G

C

2G

C

3G

C

4G

C

No

rma

lize

d E

xecu

tion

Tim

e

Figure 24. Execution time comparison for lusearch andlusearch-fix varying the number of collector threads normal-ized to lusearch with one collector thread, on one socket attwice the minimum heap size.

times. Although lusearch benefits for up to three collectorthreads, performance degrades significantly with four col-lector threads especially at a higher frequency. The samegraph shows that lusearch-fix does not suffer in the sameway and obtains better overall performance.

Figure 23 also compares running with three and five ap-plication and collector threads. Except for pjbb2005, allbenchmarks degrade in performance when increasing or de-creasing threads from the 4-4 configuration. Pseudojbb2005surprisingly performs 85% better with only three applicationand collector threads, probably because these are sufficientto perform the computational work for the input set, andmore threads just increase inter-thread communication. Theright grouping in Figure 23 increases the number of appli-cation threads to eight, while keeping the collector threadsat four. Because this experiment is performed on one socket,all benchmarks suffer because many threads are contendingfor limited resources. In general, setting the number of col-lector threads equal to application threads, and equal to thenumber of cores, seems to obtain best performance for ourmulti-threaded benchmarks running on one socket.

5. Related WorkWe first discuss previous work that tried to understand theperformance of managed language applications, and theirramifications on power. We then discuss research that isrelated to dynamic voltage frequency scaling.

5.1 Understanding JVM Services’ Performance andPower

Hu and John [13] perform a simulation-based study and eval-uate how processor core characteristics, such as issue queuesize, reorder buffer size and cache size, affect JIT com-piler and garbage collection performance. They concludethat JVM services yield different performance and powercharacteristics compared to the Java application itself.

Esmaeilzadeh et al. [10] evaluate the performance andpower consumption across five generations of microproces-sors using benchmarks implemented in both native and man-aged programming languages. The authors considered end-to-end Java workload performance, like our work, but as-sumed a single socket where we use multi-socket systems.Further, we explore how isolating and slowing/speeding upJVM service threads affects end-to-end performance.

Cao et al. [6] study how a Java application can poten-tially benefit from hardware heterogeneity. They tease apartthe interpreter, JIT compiler and garbage collector, conclud-ing that JVM services consume on average 20% of total en-ergy, ranging from 10% to 55% across the set of applica-tions considered in the study. They further study how clockfrequency, cache size, hardware parallelism and gross mi-croarchitecture design options (in-order versus out-of-orderprocessor cores) affect the performance achieved per unit ofenergy for each of the JVM services. Through this analy-sis, they advocate for heterogeneous multicore hardware, inwhich JVM services are run on customized simple cores andJava application threads run on high-performance cores.

There are at least two key differences between this priorwork [6] and ours. First, our experimental setup considersmulti-socket systems, not individual processors. Second, wefocus on end-to-end Java workload performance whereasCao et al. consider the Java application and the variousJVM services in isolation. These key differences enable usto evaluate how scaling down frequency for particular JVMservices affects overall Java workload performance. This isdone by separating out the JVM service of interest to an-other socket and scaling its frequency. A number of con-clusions that we obtain are in line with Cao et al. In par-ticular, we confirm that the Java application itself benefitssignificantly from increasing clock frequency, and garbagecollection benefits much less. We also confirm a slight im-provement to application performance if the compiler is iso-lated, regardless of whether the isolated compiler runs atthe highest or lowest frequency. However, we also obtain anumber of conclusions that are quite different from Cao etal. Whereas Cao et al. conclude that high clock frequencyis energy-efficient for the JIT compiler, we find that it haslimited impact on overall end-to-end performance. Also, al-though reducing clock frequency for the garbage collectormay be energy-efficient according to Cao et al., we find thatit negatively affects end-to-end benchmark performance.

5.2 DVFSDynamic Voltage and Frequency Scaling (DVFS) is a widelyused power reduction technique: DVFS lowers supply volt-age and clock frequency to reduce both dynamic and staticpower consumption. DVFS is being used in commercial pro-cessors across the entire computing range: from the embed-ded and mobile market up to the server market. Extensiveresearch has been done towards how to take advantage ofDVFS and reduce overall energy consumption while meet-

ing specific performance targets, or improving performancewhile not exceeding a given power budget, either through theoperating system [17], managed runtime system [23], com-piler [12, 24] or architecture [14, 16, 21].

Intel’s TurboBoost technology [15] provides the ability toincrease clock frequency for a short duration of time if theprocessor is operating below its power, current and temper-ature specification limits. The frequency at which the pro-cessor can operate during a TurboBoost period depends onthe number of active cores, i.e., the maximum frequency ishigher when fewer cores are active. TurboBoost thus has thepotential to improve performance by increasing clock fre-quency during single-threaded execution phases of a multi-threaded workload, or when few programs are active in caseof a multi-program workload environment.

DVFS is typically applied across the entire chip, i.e.,in case of a multicore processor, all cores are scaled upor down. For example, our experimental setup using theIntel Nehalem machine only allows for setting the clockfrequency at the socket-level. However, recent work hasfocused on per-core scaling, see for example [8, 16, 18].Per-core DVFS would be a useful extension to the workpresented in this paper, but we leave it for future work.

Barroso and Holzle [1] coined the term energy-proportionalcomputing to refer to the property that a computer systemshould consume energy proportional to the amount of workthat it performs. Ideally, a computer system should not con-sume energy when idling, and should only consume energyproportional to its utilization level. Our findings suggest thatmany benchmarks obtain very good performance when allthreads run on one socket, and hence also advocating mini-mizing idle power.

6. ConclusionsThis paper is one of the first to explore the many axes of ex-perimental setup of multi-threaded Java applications runningon multicore, multi-socket hardware, drawing novel conclu-sions that will help optimize running time while minimiz-ing power. While varying the number of threads, we haveshown on a single socket, the number of application and col-lector threads should be equal to the number of cores. Ontwo sockets, most benchmarks benefit from pairing appli-cation and collection threads, but achieve best performancewith fewer collector threads than application threads. Whilewe found a cost, usually less than 20%, to offloading JVMcollection threads to another socket, we found that then low-ering the core frequency of that socket provided reasonableperformance degradations in comparison with lowering theclock speed of application threads, which is very detrimentalto performance. However, isolating the compilation thread isperformance-neutral or improves performance; and isolatingthreads at startup-time is less detrimental, also sometimesimproving benchmark performance, than during steady-statetime. Many benchmarks achieve good performance when

keeping all threads on one socket, leaving the second socketidle to also save power. However, overall execution timeis lowest for all but one benchmark when the compilationthread is isolated, and for the other benchmark when thecollector thread is isolated (both regardless of the isolatedthread’s frequency). These interesting insights will help bal-ance time and power goals of both managed language work-loads and hardware resources.

This paper is a first, but important, step towards suggest-ing enhancements to both operating systems and hardwareto facilitate balancing performance and energy on multi-socket systems. It would be useful for the operating systemto work with the virtual machine to identify JVM threadsseparately from application threads and offer more fine-grained control over their movement and mapping to cores.If pinning threads is discovered to be detrimental to perfor-mance based on profiling or hardware performance counters,threads could be dynamically moved. Similarly, if hardwareprovides support for monitoring and scaling power dynam-ically, on a per-core basis for finer time quanta, we couldscale individual JVM threads for compilation and on-stack-replacement, for example, just at startup time. We couldpotentially also avoid the cost of isolating collection threadsif hardware could scale one core only during collection, andnot during application running time. When the cost of iso-lation cannot be avoided, cache optimizations for prefetch-ing or otherwise avoiding inter-socket traffic could improvememory system behavior. Our extensive analysis of end-to-end performance of multi-threaded managed applicationsrunning on multicore chips offers insights into how to designand optimize machines from application to virtual machineto operating system to memory system, down to hardware.

AcknowledgementsWe thank the anonymous reviewers for their valuable feed-back. We also thank Wim Heirman for his help and tech-nical expertise on thread pinning and frequency scaling onthe Intel Nehalem machine. The research leading to theseresults has received funding from the European ResearchCouncil under the European Community’s Seventh Frame-work Programme (FP7/2007-2013) / ERC Grant agreementno. 259295.

References[1] L. A. Barroso and U. Holzle. The case for energy-proportional

systems. IEEE Computer, 40:33–37, Dec. 2007.

[2] S. M. Blackburn and K. S. McKinley. Immix: A mark-region garbage collector with space efficiency, fast collection,and mutator locality. In Programming Language Design andImplementation (PLDI), pages 22–32, Tuscon, AZ, June 2008.

[3] S. M. Blackburn, M. Hirzel, R. Garner, and D. Stefanovic.pjbb2005: The pseudojbb benchmark. URL http://users.cecs.anu.edu.au/˜steveb/research/research-infrastructure/pjbb2005.

[4] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S.McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Framp-ton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee,J. E. B. Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen,D. von Dincklage, and B. Wiedermann. The DaCapo bench-marks: Java benchmarking development and analysis. In ACMSIGPLAN Conference on Object-Oriented Programing, Sys-tems, Languages, and Applications (OOPSLA), pages 169–190, Oct. 2006.

[5] S. M. Blackburn, K. S. McKinley, R. Garner, C. Hoffman,A. M. Khan, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton,S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B.Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen, D. vonDincklage, and B. Wiedermann. Wake up and smell the cof-fee: Evaluation methodology for the 21st century. Communi-cations of the ACM, 51(8):83–89, Aug. 2008.

[6] T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley. The yinand yang of power and performance for asymmetric hardwareand managed software. In The 39th International Symposiumon Computer Architecture (ISCA), pages 225–236, June 2012.

[7] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous,and A. R. LeBlanc. Design of ion-implanted mosfet’s withvery small physical dimensions. IEEE Journal of Solid-StateCircuits, Oct 1974.

[8] J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos,D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. Anintegrated quad-core Opteron processor. In Proceedings of theInternational Solid State Circuits Conference (ISSCC), pages102–103, Feb. 2007.

[9] H. Esmaeilzadeh, E. R. Blem, R. S. Amant, K. Sankaralingam,and D. Burger. Dark silicon and the end of multicore scaling.In 38th International Symposium on Computer Architecture(ISCA), pages 365–376, June 2011.

[10] H. Esmaeilzadeh, T. Cao, Y. Xi, S. M. Blackburn, and K. S.McKinley. Looking back on the language and hardware rev-olutions: Measured power, performance, and scaling. In Pro-ceedings of the 16th International Conference on Architec-tural Support for Programming Languages and OperatingSystems (ASPLOS), pages 319–332, June 2011.

[11] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rig-orous Java performance evaluation. In Proceedings of theAnnual ACM SIGPLAN Conference on Object-Oriented Pro-gramming, Languages, Applications and Systems (OOPSLA),pages 57–76, Oct. 2007.

[12] C.-H. Hsu and U. Kremer. The design, implementation, andevaluation of a compiler algorithm for CPU energy reduction.In Proceedings of the International Symposium on Program-ming Language Design and Implementation (PLDI), pages38–48, June 2003.

[13] S. Hu and L. K. John. Impact of virtual execution environ-ments on processor energy consumption and hardware adap-tation. In International Conference on Virtual Execution En-vironments (VEE), pages 100–110, June 2006.

[14] C. J. Hughes, J. Srinivasan, and S. V. Adve. Saving energywith architectural and frequency adaptations for multimediaapplications. In Proceedings of the 34th Annual InternationalSymposium on Microarchitecture (MICRO), pages 250–261,Dec. 2001.

[15] Intel Coorporation. Intel turbo boost technology in Intel coremicroarchitecture (Nehalem) based processors, Nov 2008.

[16] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, andM. Martonosi. An analysis of efficient multi-core globalpower management policies: Maximizing performance for agiven power budget. In Proceedings of the International Sym-posium on Microarchitecture (MICRO), pages 347–358, Dec.2006.

[17] C. Isci, G. Contreras, and M. Martonosi. Live, runtime phasemonitoring and prediction on real systems and application todynamic power management. In Proceedings of the Interna-tional Symposium on Microarchitecture (MICRO), pages 359–370, Dec. 2006.

[18] W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks. Systemlevel analysis of fast, per-core DVFS using on-chip switchingregulators. In Proceedings of the International Symposiumon High-Performance Computer Architecture (HPCA), pages123–134, Feb. 2008.

[19] G. E. Moore. Readings in computer architecture. chapterCramming more components onto integrated circuits, pages56–59. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA, 2000.

[20] Y. Seeley. JIRA issue LUCENE-1800: QueryParser shoulduse reusable token streams, 2009. URL https://issues.apache.org/jira/browse/LUCENE-1800.

[21] G. Semeraro, D. H. Albonesi, S. G. Dropsho, G. Magklis,S. Dwarkadas, and M. L. Scott. Dynamic frequency and volt-age control for a multiple clock domain microarchitecture. InProceedings of the International Symposium on Microarchi-tecture (MICRO), pages 356–367, Nov. 2002.

[22] TIOBE Software. TIOBE programming community index,2011. http://tiobe.com/tpci.html.

[23] Q. Wu, V. J. Reddi, Y. Wu, J. Lee, D. Connors, D. Brooks,M. Martonosi, and D. W. Clark. A dynamic compilationframework for controlling microprocessor energy and perfor-mance. In Proceedings of the International Symposium onMicroarchitecture (MICRO), pages 271–282, Nov. 2005.

[24] F. Xie, M. Martonosi, and S. Malik. Compile-time dynamicvoltage scaling settings: Opportunities and limits. In Proceed-ings of the International Symposium on Programming Lan-guage Design and Implementation (PLDI), pages 49–62, June2003.

[25] X. Yang, S. Blackburn, D. Frampton, J. Sartor, and K. McKin-ley. Why nothing matters: The impact of zeroing. In Proceed-ings of the 2011 ACM International Conference on ObjectOriented Programming Systems Languages and Applications(OOPSLA), pages 307–324, Oct 2011.

exploring multi-threaded java application performance on

Documents