ibm t. j. watson research center - cornell university§pratap pattnaik, manish gupta §trey cain,...

IBM T. J. Watson Research Center

End-to-End Project 2/9/2006 © 2003 IBM Corporation

End-to-End Performance Optimization of Java Server Workloads

Jong-Deok Choi

([email protected])


© 2003 IBM Corporation2 End-to-End Project 2/9/2006

People

§ Pratap Pattnaik, Manish Gupta

§ Trey Cain, Jong-Deok Choi, Suhyun Kim, Kyung Ryu, Mauricio Serrano, Yefim Shuf, Gilad Arnold, Ian Steiner, Richard Zhuang

§ Joefon Jann, Christoph von Praun, Stephen E Smith, Il Park

§ Toshio Nakatani, Kazuaki Ishizaki, Tamiya Onodera



Outline

§ End-to-End Optimization Project

– Workload and Server Configurations

§ Methodology

§ Performance Characteristics

– Method Profiling, Cache, Branch Prediction, Synchronization

§ Summary



End-to-End Optimization Project

§ To understand and optimize the performance of the whole-stack, end-to-end SW/HW layers of commercial middleware applications (J2EE) on IBM’s current and future high-end servers.



Workload: J2EE Multi-tier Server w/ Application

Source: Programming J2EE APIs with WebSphere Advanced by Osamu Takagiwa, et. al. Ibm.com/redbooks

WAS: WebSphere Application Server



J2EE Whole-Stack End-to-End Optimization

HW

OS

WebSphere

Java Application

Java VM



Server Configuration for SpecJAS2004/Trade6

pSeries p5704x1.65GHz POWER5 (SMT Enabled)15GB Main MemoryWAS 6.0, AIX 5.30 GOLD

1GbpsI B M

pSeries p6906x1.1GHz POWER4 16GB Main Memory, DB2 UDB v.8.2, AIX 5.2B GOLD

driver

pSeries p6906x1.1GHz POWER4 16GB Main MemoryWAS 6.0, AIX 5.2B GOLD

IBM

server

p S e r i e s

IBM

server

p S e r i e s



Outline



§Methodology§ Performance Characteristics

§ Summary



e2eDriver with Static/Dynamic Instrumentation of Apps, JVM, and OS

eCLipz

OS

WebSphere

WAS Application

Java VM

Temporal

Event C

orrelationS

patial C

ode Analysis

+Measurem

ents

Correlation/ModelIdentification of Bottlenecks and Their Solutions

Design Changes

A

B C

Callgraph

time

Perform

. metrics

pm1

pm2

End-to-End Optimization Methodology



Performance Metrics and Tools

§ Response time of each transaction

Application Response Management (ARM)

Performance Monitor Infrastructure (PMI) in WAS

# of executing beans, # of activated beans

Java Instrumentation# of method calls, # of GC, # of object allocations, # of syncs

# of context switches AIX trace facility, vmstat, sar

HW Performance counters in POWER4/5

# of inst., # of loads, # of D$ misses

Metrics Examples Method

HW

OS

WebSphere

WAS Application

Java VM



Hardware Performance Monitor (HPM)§ POWER4 has 8 HPM counters that can be programmed to count HW events

- The HW events are combined into logical groups

- There are 61 groups, and 8 events per group (one event per counter)

...

0 1 2 3 4 5 6 7 HPM

TLB POWER4/5L1-D L1-D L2 CYC

PMAPIpm_init(), pm_start(), …

Group 56: CPI, TLB, L1-D cache

§ PM_DTLB_MISS -- Data TLB misses

§ PM_ITLB_MISS -- Instruction TLB misses

§ PM_LD_MISS_L1 - L1 D cache load misses

§ PM_ST_MISS_L1 - L1 D cache store misses

§ PM_CYC -------- Processor cycles

§ PM_INST_CMPL -- Instructions completed

§ PM_ST_REF_L1 -- L1 D cache store references

§ PM_LD_REF_L1 -- L1 D cache load references



HW Performance Counters

1. Trade3/WAS/Sovereign/AIX/Power4, System+User

2. hpm counters, group 3, ~100 seconds

3. Steady state CPI = ~4.5



1. SpecJBB/J9/AIX/Power4, System+User

2. hpm counters, group 3, ~900 seconds

3. Steady state CPI = ~2.5

HW Performance Counters



1. Micro Event 1 Micro Event (e.g., performance metrics)

– CPI 1 TLB misses

2. Micro Event 1 Macro Event

– TLB misses 1 Page fault at OS

3. Macro Event 1 Macro Event

– Page fault at OS 1 Class Loading

§ Temporal event correlation employs various statistical tools such as covariance

Temporal Event Correlation

time

Perform

ance metrics

pm1

pm2

CLload1 CLload2

pg fault

1

23

pg fault



Derived Metrics - Correlations

Given two vectors X = {xi} and Y = {yi}

covar(X,Y) = (1/n) Σ (xi – ¯) (yi – ¯) x y

cc(X,Y) = covar(X,Y) / SQRT [covar(X,X) * covar(Y,Y)]

-1 0 1strongly

anti-correlatedstrongly

correlated

• Observe Trends: transient, steady-state, periodic

• Certain correlations are expected; spot the unexpected

• Needs systematic study



Spatial Code Analysis

§ Profile-based hot-code analysis

– Context-insensitive• Identify m hot (frequently executed) methods• May fail to provide the contexts in which methods are hot

– Context-sensitive• Identify n hot dynamic call chains

– e.g. critical call-path information

• May fail to recognize hot methods with uniform and low unit-cost

Call graph

A

B C[4]

[3]

[6]

[cumul 7, base 4]A

B C

[cumul 13, base 0]

[cumul 6, base 3]

C A[cumul 3, base 3] [cumul 3, base 3]

Call tree



Context-Sensitive Analysis

§ SPECjAppServer2004: 50% of JIT’d code execution is in 224 “hottest”methods– Method profiel is “flat”– Data profile is also “flat”

§ Profile-based hot-code analysis

– Identify n hot dynamic call chains• e.g. critical call-path information

§ “Accurate, Efficient, and Adaptive Calling-Context Profiling,” PLDI2006

– X. Zhuang, M. Serrano, T. Cain, and J.-D. Choi



Context-Senstive Analysis§ Call sequence:

– ‘->’: method call, ‘<-’: method return, ‘(A)’ : ‘A’ is top-of-stack

– A -> B -> C -> E, <-, (C) ->E, <, <-, <-, (A)-> D -> C, <-. <-, (A) -> B -> C -> E , <-, (C) ->E, <-, <-, <-, (A) -> D -> C, ->A, <-, <-, <-, (A)

Call TreeCall Graph (edge profiling)

Calling-Context Tree (CCT)

A

DB

E

C C

B D

C C

A

DB

E

C

2

A

B

E

C

E A

4

2

2

21

D

A

C

2

2

4

2

2

1

E E

§ Call tree is too expensive: one node per each method call

§ Call graph is too imprecise: cannot tell whether B or D is more responsible for the frequent calls of E by C

§ CCT is not as expensive as call tree; on CCT it’s clear B->C->E is the expensive call path.



Outline



§ Methodology

§Performance Characteristics§ Summary



CPI Stacks

0

0.5

1

1.5

2

2.5

3

3.5

SPECjAppServer Trade6 SPECjbb

CP

I

instruction supply stalls

LSU reject stalls

LSU translation stalls

LSU flush stalls + basic latency

LSU dcache miss stalls

FXU + FPU latencies

Other Stalls (incl BRU/CRU instr latencies,non-LSU flush penalty)

Instructions complete



Instruction supply stall breakdown

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


% in

stru

ctio

n q

ueu

e em

pty

cyc

les

Other (store queue full, other flush)

Branch Mispredict

I-cache Miss



0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 500 1000 1500 2000 2500 3000 3500 4000

L2L3MEM

L1 Miss Data Load Patterns: JAS2004

Java Meta Data

Java Heap

Address in MB



Types of Java Heap misses – SPECjbb2000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%REMAINING

spec/jbb/Item

char[]

java/lang/String

spec/jbb/Customer

java/lang/Object[]

long[]

spec/jbb/infra/Collections/longBTreeNode

spec/jbb/Stock



Types of Java Heap misses – JAS 2004

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%REMAINING

com/ibm/

com/ibm/

com/ibm/

com/ibm/

com/ibm/

com/ibm/

org/apache/jasper/runtime/JspWriterImpl

com/ibm/

long[]

com/ibm/…

java/lang/Object[]

int[]

byte[]

java/lang/String

com/ibm/

char[]



Misses by Component

0

10

20

30

40

50

60

70

80

90

100

other

pthreads

inet

unix

jvmother

jvm23

jit

JITTED



Misses by Component

0

10

20

30

40

50

60

WebContainer ORB Inbound Reader Default

otherpthreads

inetunix

jvmother

jvm23jit

JITTED



WAS (other)34%

IHS16%

Other6%

DB215%

WAS (JIT)29%

JIT Analysis § Data Collected from last 5 minutes of 60 minute run

§ 63% CPU time in WAS

§ JIT’d Code in WAS (48% of WAS execution, 29% overall)– Jas2004 JIT’d code: 3% of all JIT’d code– Enterprise Java Service <com.ibm.ejs>: 22% of

JIT’d– WebSphere <com.ibm.ws>: 28% of JIT’d

§ Not-JIT’d Code in WAS (the other 52% of WAS execution time)– 15% in kernel– 12% in libdb2.a– 11% in libmqmcs_r.a– 9% in libj9vm22.so

§ 50% of JIT’d code execution is in 224 “hottest”methods– Method profiel is “flat”– Data profile is also “flat”



0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 500 1000 1500 2000 2500 3000 3500 4000

L2L3MEM

L1 Miss Data Load Patterns: JAS2004

Java Meta Data

Java Heap

Address in MB



Data Cache Misses

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SPECjbb Trade6 JAS2004

java heap

meta data

remaining



Loads from L3 Classified by Region

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SPECjbb JAS2004

java heap

Meta data

remaining



What is Meta-Data?

§ JVM data structure not directly accessible by user application:

– Object type information, Class information, Dispatch table, …

– Mostly accessed via indirection

– Heavily used in Java

• Invokevirtual, invokeinterface, checkcast, instanceof, …



Capacity or Communication?

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

SPEC

jAppS

erver

Trade

6

SPEC

jbb

L2

mis

ses

per

inst

r Memory Remote

Memory Local

L375 Mod

L375 Shared

L3 Local

L275 Mod

L275 Shared

L2 misses per instruction



0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Cyc

les

per

L2 m

iss Memory Remote

Memory Local

L375 Mod

L375 Shared

L3 Local

L275 Mod

L275 Shared

L2 miss cycles per instruction (approx.)

Capacity or Communication?



Data Cache Performance: Summary

§ Memory ops are common - almost 50% of instructions

§ Stronger Load performance, relatively weaker Store performance

§ Mostly capacity misses w/o many communication misses

§ Object Meta-Data accounts for a large portion of D$ misses

– Invokevirtual, invokeinterface, checkcast, instanceof, …



Branch Misprediction § Relatively high as expected

– Correlated with GC events

§ Target Address (TA) Misses are strongly correlated with L1 I$ miss rate (0.9)

– TA misses could lead to fetching useless instructions, evicting useful data & instructions

§ No apparent L1 D$ pollution

– Low correlation between “speculation” rate and L1 D$ misses

Relatively high mispr. rate, insignificant correlation with CPI

Time (seconds)

Mis

pred

ictio

nR

ate

(per

cent

)

6

2

23s

Condition Misses / BranchesTarget Address Misses / Branches

160



Comparison (Aggregate View), AIX/Power4A: TrWasSovS+U: Trade3, WAS, Sovereign, System+User, 4 CPUs

B: TrWasJ9S+U: Trade3, WAS, J9, System+User, 4 CPUsC: TrWasJ9U: Trade3, WAS, J9, User, 4 CPUs

D: JbbJ9S+U: SpecJBB, J9, System+User, 4 CPUs

E: TPC-C: Native (C) Code 32 CPUs

0.3927 %

0.7619 %

1.1546 %

6.0714 %

1.7619 %

4.3095 %

19.0140 %

3.682

E: TPC-CD: JbbJ9S+UC: TrWasJ9UB: TrWasJ9S+UA: TrWasSovS+U

1.793.3133.7803.8461: CPI

18.96 %23.27 %23.38 %24.43 %2: BR/Inst

1.35 %5.38 %4.64 %4.96 %4: MPRED_TA/BR

5.36 %5.28 %5.23 %6.48 %3: MPRED_CR/BR

6.71 %10.64 %9.87 %11.44 %5: MPRED/BR

1.02 %1.23 %1.22 %1.58 %7: MPRED_CR/Inst

0.26 %1.25 %1.09 %1.21 %8: MPRED_TA/Inst

2.48 % 1.27 %2.27 %2.79 %6: MPRED/Inst:

1. Small Java on J9 (D) shows very good CPI (1.79)

2. Branch Rate: WAS/apps (A – C) > small Java (D), Native code (E)

7. Branch Misprediction (CR: conditional): WAS/apps (A – C) > small Java (D), Native code (E), 2:1

8. Branch Misprediction (TA: target addr): WAS/apps (A – C) >> small Java (D), 4:1; Native (E), 3:1



Address Translation

§ Tolerable frequency of TLB & ERAT misses

– 2 - 3 orders of magnitude fewer TLB misses during GC

• Graph fitted using Bezier smoothing -- spikes actually correspond to events that take 0.2 - 0.3s. -- the time of a GC

– ~500 instructions / DTLB miss

– ~25% of DERAT misses result in a TLB Miss à can be expensive

§ Large Pages help!

– DTLB Miss Rate improved by 25%

– ITLB Miss Rate improved by 15%

Much of working set is maintained in ERAT and TLB

Time (seconds)

Mis

s / I

nstr

uctio

n (p

erce

nt)

160

1%

DERAT MissIERAT Updated

DTLB MissITLB Miss



Outline



§ Methodology

§Performance Characteristics

–Synchronizations§ Summary



Synchronization Overhead

§ Mem-sync instructions can occur frequently in multi-threaded server code.

§ Cost is relatively high

Cycle times for memory barriers and atomic read/write

7580lwarx/stwcx

50140sync

25110lwsync

1030isyncPower5Power4



Locking Frequence

630

2320

1530

45

30

<<1

1650

<< 1

120

freq [ops/ms]benchmark

jigsaw

hedc

trade6

jbb (16 wh)

jbb (4wh)

jgf_ray

jgf_monte

jgf_mol

jvm98_mtrt



Overhead in single-threaded benchmarks

§ IBM’s commercial VM on 4-way Power5 1.6 GHz

§ Removed sync-operations in the JIT code generator

1.04jgf_monte_B

speedup

1.04

1.021.091.041.15

jbb (1 wh)

jgf_monte_Ajvm98_javacjvm98_jackjvm98_db



Thread-Local Locks

Fraction of lock operations on thread-local locks [%]:

45.5jigsaw

89.6hedc

30.3

18.9

33.5

81.3

99.6

82.7

99.3

trade6

jbb (16 wh)

jbb (4wh)

jgf_ray

jgf_monte

jgf_mol

jvm98_mtrt



Lock locality

Dynamic fraction of lock operations that have locality to <...>:

99.791.799.7jbb (16 wh)

64.5

94.9

91.8

97.6

99.0

98.8

98.2

99.9

proc [%]

98.197.2hedc

thr or proc [%]thr [%]

74.5

99.7

84.8

99.0

99.8

98.4

99.9

76.7trade6

99.7jbb (4wh)

91.8jigsaw

99.2jgf_ray

99.8jgf_monte

98.5jgf_mol

99.9jvm98_mtrt

scheduling dependence

implies thread-local



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

BranchesTarget Misses

Condition Misses

ITLB MissDTLB Miss

Tablewalk CyclesIERAT Translation

DERAT Miss

SRQ SyncISLB MissDSLB Miss

L1IL2IL3I

MemL2.5I L2.75I

L3.5ICyc w/ Return

Speculation Rate

D$ Prefetch Stream AllocL2 Prefetches

L1D Prefetches

L1D Load MissL1D Store Miss

L1D Load RefL1D Load Ref

Br

Tra

ns

Mis

cIn

str

Pre

fL1

D$

Sta

tist

ic

CPI Correlation

No single parameter is perfectly correlated with CPI – a balanced system

Branch misprediction are not correlated with CPI

L1 D$ eventsnot strongly correlated with CPI

Positive Correlation Negative Correlation

I$ fetches and Address Translation correlate with CPI

(x − x∑ )(y − y)

(x − x∑ )2 (y − y)2∑

Prefetches and stream alloc. are correlated

with CPI



Summary

§ We have presented performance characteristics of Java server workloads

§ Unlike a desktop system, GC is not a big issue

§ They have higher branch-target misprediction rate

§ Data cache miss rates are high – Java meta data cache miss is high– Mostly capacity misses, and low communication misses

§ About 60% of CPU time is in Java, and ½ of that is in the jited code

§ Method profile is flat

§ Quite a large number of redundant memory sync operations

§ No single performance metric has a very high correlation with CPI.



Garbage Collection

GC not as significant as past characterization papers have shown

Benchmark Execution Time (HH:MM)

GC

Tim

es (

ms)

Hea

p U

tiliz

atio

n (p

erce

nt)

1:00

30%450

TotalMark

Heap Utilization %Sweep

Time Between (s) 25-28GC Time (ms) 300 - 400

Percent of Runtime 1.30%

ibm t. j. watson research center - cornell university§pratap pattnaik, manish gupta §trey cain,...

Documents