ibm t. j. watson research center - cornell university§pratap pattnaik, manish gupta §trey cain,...
TRANSCRIPT
IBM T. J. Watson Research Center
End-to-End Project 2/9/2006 © 2003 IBM Corporation
End-to-End Performance Optimization of Java Server Workloads
Jong-Deok Choi
IBM T. J. Watson Research Center
© 2003 IBM Corporation2 End-to-End Project 2/9/2006
People
§ Pratap Pattnaik, Manish Gupta
§ Trey Cain, Jong-Deok Choi, Suhyun Kim, Kyung Ryu, Mauricio Serrano, Yefim Shuf, Gilad Arnold, Ian Steiner, Richard Zhuang
§ Joefon Jann, Christoph von Praun, Stephen E Smith, Il Park
§ Toshio Nakatani, Kazuaki Ishizaki, Tamiya Onodera
IBM T. J. Watson Research Center
© 2003 IBM Corporation3 End-to-End Project 2/9/2006
Outline
§ End-to-End Optimization Project
– Workload and Server Configurations
§ Methodology
§ Performance Characteristics
– Method Profiling, Cache, Branch Prediction, Synchronization
§ Summary
IBM T. J. Watson Research Center
© 2003 IBM Corporation4 End-to-End Project 2/9/2006
End-to-End Optimization Project
§ To understand and optimize the performance of the whole-stack, end-to-end SW/HW layers of commercial middleware applications (J2EE) on IBM’s current and future high-end servers.
IBM T. J. Watson Research Center
© 2003 IBM Corporation5 End-to-End Project 2/9/2006
Workload: J2EE Multi-tier Server w/ Application
Source: Programming J2EE APIs with WebSphere Advanced by Osamu Takagiwa, et. al. Ibm.com/redbooks
WAS: WebSphere Application Server
IBM T. J. Watson Research Center
© 2003 IBM Corporation6 End-to-End Project 2/9/2006
J2EE Whole-Stack End-to-End Optimization
HW
OS
WebSphere
Java Application
Java VM
IBM T. J. Watson Research Center
© 2003 IBM Corporation7 End-to-End Project 2/9/2006
Server Configuration for SpecJAS2004/Trade6
pSeries p5704x1.65GHz POWER5 (SMT Enabled)15GB Main MemoryWAS 6.0, AIX 5.30 GOLD
1GbpsI B M
pSeries p6906x1.1GHz POWER4 16GB Main Memory, DB2 UDB v.8.2, AIX 5.2B GOLD
driver
pSeries p6906x1.1GHz POWER4 16GB Main MemoryWAS 6.0, AIX 5.2B GOLD
IBM
server
p S e r i e s
IBM
server
p S e r i e s
IBM T. J. Watson Research Center
© 2003 IBM Corporation8 End-to-End Project 2/9/2006
Outline
§ End-to-End Optimization Project
– Workload and Server Configurations
§Methodology§ Performance Characteristics
§ Summary
IBM T. J. Watson Research Center
© 2003 IBM Corporation9 End-to-End Project 2/9/2006
e2eDriver with Static/Dynamic Instrumentation of Apps, JVM, and OS
eCLipz
OS
WebSphere
WAS Application
Java VM
Temporal
Event C
orrelationS
patial C
ode Analysis
+Measurem
ents
Correlation/ModelIdentification of Bottlenecks and Their Solutions
Design Changes
A
B C
Callgraph
time
Perform
. metrics
pm1
pm2
End-to-End Optimization Methodology
IBM T. J. Watson Research Center
© 2003 IBM Corporation10 End-to-End Project 2/9/2006
Performance Metrics and Tools
§ Response time of each transaction
Application Response Management (ARM)
Performance Monitor Infrastructure (PMI) in WAS
# of executing beans, # of activated beans
Java Instrumentation# of method calls, # of GC, # of object allocations, # of syncs
# of context switches AIX trace facility, vmstat, sar
HW Performance counters in POWER4/5
# of inst., # of loads, # of D$ misses
Metrics Examples Method
HW
OS
WebSphere
WAS Application
Java VM
IBM T. J. Watson Research Center
© 2003 IBM Corporation11 End-to-End Project 2/9/2006
Hardware Performance Monitor (HPM)§ POWER4 has 8 HPM counters that can be programmed to count HW events
- The HW events are combined into logical groups
- There are 61 groups, and 8 events per group (one event per counter)
...
0 1 2 3 4 5 6 7 HPM
TLB POWER4/5L1-D L1-D L2 CYC
PMAPIpm_init(), pm_start(), …
Group 56: CPI, TLB, L1-D cache
§ PM_DTLB_MISS -- Data TLB misses
§ PM_ITLB_MISS -- Instruction TLB misses
§ PM_LD_MISS_L1 - L1 D cache load misses
§ PM_ST_MISS_L1 - L1 D cache store misses
§ PM_CYC -------- Processor cycles
§ PM_INST_CMPL -- Instructions completed
§ PM_ST_REF_L1 -- L1 D cache store references
§ PM_LD_REF_L1 -- L1 D cache load references
IBM T. J. Watson Research Center
© 2003 IBM Corporation12 End-to-End Project 2/9/2006
HW Performance Counters
1. Trade3/WAS/Sovereign/AIX/Power4, System+User
2. hpm counters, group 3, ~100 seconds
3. Steady state CPI = ~4.5
IBM T. J. Watson Research Center
© 2003 IBM Corporation13 End-to-End Project 2/9/2006
1. SpecJBB/J9/AIX/Power4, System+User
2. hpm counters, group 3, ~900 seconds
3. Steady state CPI = ~2.5
HW Performance Counters
IBM T. J. Watson Research Center
© 2003 IBM Corporation14 End-to-End Project 2/9/2006
1. Micro Event 1 Micro Event (e.g., performance metrics)
– CPI 1 TLB misses
2. Micro Event 1 Macro Event
– TLB misses 1 Page fault at OS
3. Macro Event 1 Macro Event
– Page fault at OS 1 Class Loading
§ Temporal event correlation employs various statistical tools such as covariance
Temporal Event Correlation
time
Perform
ance metrics
pm1
pm2
CLload1 CLload2
pg fault
1
23
pg fault
IBM T. J. Watson Research Center
© 2003 IBM Corporation15 End-to-End Project 2/9/2006
Derived Metrics - Correlations
Given two vectors X = {xi} and Y = {yi}
covar(X,Y) = (1/n) Σ (xi – ¯) (yi – ¯) x y
cc(X,Y) = covar(X,Y) / SQRT [covar(X,X) * covar(Y,Y)]
-1 0 1strongly
anti-correlatedstrongly
correlated
• Observe Trends: transient, steady-state, periodic
• Certain correlations are expected; spot the unexpected
• Needs systematic study
IBM T. J. Watson Research Center
© 2003 IBM Corporation16 End-to-End Project 2/9/2006
Spatial Code Analysis
§ Profile-based hot-code analysis
– Context-insensitive• Identify m hot (frequently executed) methods• May fail to provide the contexts in which methods are hot
– Context-sensitive• Identify n hot dynamic call chains
– e.g. critical call-path information
• May fail to recognize hot methods with uniform and low unit-cost
Call graph
A
B C[4]
[3]
[6]
[cumul 7, base 4]A
B C
[cumul 13, base 0]
[cumul 6, base 3]
C A[cumul 3, base 3] [cumul 3, base 3]
Call tree
IBM T. J. Watson Research Center
© 2003 IBM Corporation17 End-to-End Project 2/9/2006
Context-Sensitive Analysis
§ SPECjAppServer2004: 50% of JIT’d code execution is in 224 “hottest”methods– Method profiel is “flat”– Data profile is also “flat”
§ Profile-based hot-code analysis
– Identify n hot dynamic call chains• e.g. critical call-path information
§ “Accurate, Efficient, and Adaptive Calling-Context Profiling,” PLDI2006
– X. Zhuang, M. Serrano, T. Cain, and J.-D. Choi
IBM T. J. Watson Research Center
© 2003 IBM Corporation18 End-to-End Project 2/9/2006
Context-Senstive Analysis§ Call sequence:
– ‘->’: method call, ‘<-’: method return, ‘(A)’ : ‘A’ is top-of-stack
– A -> B -> C -> E, <-, (C) ->E, <, <-, <-, (A)-> D -> C, <-. <-, (A) -> B -> C -> E , <-, (C) ->E, <-, <-, <-, (A) -> D -> C, ->A, <-, <-, <-, (A)
Call TreeCall Graph (edge profiling)
Calling-Context Tree (CCT)
A
DB
E
C C
B D
C C
A
DB
E
C
2
A
B
E
C
E A
4
2
2
21
D
A
C
2
2
4
2
2
1
E E
§ Call tree is too expensive: one node per each method call
§ Call graph is too imprecise: cannot tell whether B or D is more responsible for the frequent calls of E by C
§ CCT is not as expensive as call tree; on CCT it’s clear B->C->E is the expensive call path.
IBM T. J. Watson Research Center
© 2003 IBM Corporation19 End-to-End Project 2/9/2006
Outline
§ End-to-End Optimization Project
– Workload and Server Configurations
§ Methodology
§Performance Characteristics§ Summary
IBM T. J. Watson Research Center
© 2003 IBM Corporation20 End-to-End Project 2/9/2006
CPI Stacks
0
0.5
1
1.5
2
2.5
3
3.5
SPECjAppServer Trade6 SPECjbb
CP
I
instruction supply stalls
LSU reject stalls
LSU translation stalls
LSU flush stalls + basic latency
LSU dcache miss stalls
FXU + FPU latencies
Other Stalls (incl BRU/CRU instr latencies,non-LSU flush penalty)
Instructions complete
IBM T. J. Watson Research Center
© 2003 IBM Corporation21 End-to-End Project 2/9/2006
Instruction supply stall breakdown
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SPECjAppServer Trade6 SPECjbb
% in
stru
ctio
n q
ueu
e em
pty
cyc
les
Other (store queue full, other flush)
Branch Mispredict
I-cache Miss
IBM T. J. Watson Research Center
© 2003 IBM Corporation22 End-to-End Project 2/9/2006
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 500 1000 1500 2000 2500 3000 3500 4000
L2L3MEM
L1 Miss Data Load Patterns: JAS2004
Java Meta Data
Java Heap
Address in MB
IBM T. J. Watson Research Center
© 2003 IBM Corporation23 End-to-End Project 2/9/2006
Types of Java Heap misses – SPECjbb2000
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%REMAINING
spec/jbb/Item
char[]
java/lang/String
spec/jbb/Customer
java/lang/Object[]
long[]
spec/jbb/infra/Collections/longBTreeNode
spec/jbb/Stock
IBM T. J. Watson Research Center
© 2003 IBM Corporation24 End-to-End Project 2/9/2006
Types of Java Heap misses – JAS 2004
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%REMAINING
com/ibm/
com/ibm/
com/ibm/
com/ibm/
com/ibm/
com/ibm/
org/apache/jasper/runtime/JspWriterImpl
com/ibm/
long[]
com/ibm/…
java/lang/Object[]
int[]
byte[]
java/lang/String
com/ibm/
char[]
IBM T. J. Watson Research Center
© 2003 IBM Corporation25 End-to-End Project 2/9/2006
Misses by Component
0
10
20
30
40
50
60
70
80
90
100
other
pthreads
inet
unix
jvmother
jvm23
jit
JITTED
IBM T. J. Watson Research Center
© 2003 IBM Corporation26 End-to-End Project 2/9/2006
Misses by Component
0
10
20
30
40
50
60
WebContainer ORB Inbound Reader Default
otherpthreads
inetunix
jvmother
jvm23jit
JITTED
IBM T. J. Watson Research Center
© 2003 IBM Corporation27 End-to-End Project 2/9/2006
WAS (other)34%
IHS16%
Other6%
DB215%
WAS (JIT)29%
JIT Analysis § Data Collected from last 5 minutes of 60 minute run
§ 63% CPU time in WAS
§ JIT’d Code in WAS (48% of WAS execution, 29% overall)– Jas2004 JIT’d code: 3% of all JIT’d code– Enterprise Java Service <com.ibm.ejs>: 22% of
JIT’d– WebSphere <com.ibm.ws>: 28% of JIT’d
§ Not-JIT’d Code in WAS (the other 52% of WAS execution time)– 15% in kernel– 12% in libdb2.a– 11% in libmqmcs_r.a– 9% in libj9vm22.so
§ 50% of JIT’d code execution is in 224 “hottest”methods– Method profiel is “flat”– Data profile is also “flat”
IBM T. J. Watson Research Center
© 2003 IBM Corporation28 End-to-End Project 2/9/2006
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 500 1000 1500 2000 2500 3000 3500 4000
L2L3MEM
L1 Miss Data Load Patterns: JAS2004
Java Meta Data
Java Heap
Address in MB
IBM T. J. Watson Research Center
© 2003 IBM Corporation29 End-to-End Project 2/9/2006
Data Cache Misses
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SPECjbb Trade6 JAS2004
java heap
meta data
remaining
IBM T. J. Watson Research Center
© 2003 IBM Corporation30 End-to-End Project 2/9/2006
Loads from L3 Classified by Region
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SPECjbb JAS2004
java heap
Meta data
remaining
IBM T. J. Watson Research Center
© 2003 IBM Corporation31 End-to-End Project 2/9/2006
What is Meta-Data?
§ JVM data structure not directly accessible by user application:
– Object type information, Class information, Dispatch table, …
– Mostly accessed via indirection
– Heavily used in Java
• Invokevirtual, invokeinterface, checkcast, instanceof, …
IBM T. J. Watson Research Center
© 2003 IBM Corporation32 End-to-End Project 2/9/2006
Capacity or Communication?
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
SPEC
jAppS
erver
Trade
6
SPEC
jbb
L2
mis
ses
per
inst
r Memory Remote
Memory Local
L375 Mod
L375 Shared
L3 Local
L275 Mod
L275 Shared
L2 misses per instruction
IBM T. J. Watson Research Center
© 2003 IBM Corporation33 End-to-End Project 2/9/2006
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
SPECjAppServer Trade6 SPECjbb
Cyc
les
per
L2 m
iss Memory Remote
Memory Local
L375 Mod
L375 Shared
L3 Local
L275 Mod
L275 Shared
L2 miss cycles per instruction (approx.)
Capacity or Communication?
IBM T. J. Watson Research Center
© 2003 IBM Corporation34 End-to-End Project 2/9/2006
Data Cache Performance: Summary
§ Memory ops are common - almost 50% of instructions
§ Stronger Load performance, relatively weaker Store performance
§ Mostly capacity misses w/o many communication misses
§ Object Meta-Data accounts for a large portion of D$ misses
– Invokevirtual, invokeinterface, checkcast, instanceof, …
IBM T. J. Watson Research Center
© 2003 IBM Corporation35 End-to-End Project 2/9/2006
Branch Misprediction § Relatively high as expected
– Correlated with GC events
§ Target Address (TA) Misses are strongly correlated with L1 I$ miss rate (0.9)
– TA misses could lead to fetching useless instructions, evicting useful data & instructions
§ No apparent L1 D$ pollution
– Low correlation between “speculation” rate and L1 D$ misses
Relatively high mispr. rate, insignificant correlation with CPI
Time (seconds)
Mis
pred
ictio
nR
ate
(per
cent
)
6
2
23s
Condition Misses / BranchesTarget Address Misses / Branches
160
IBM T. J. Watson Research Center
© 2003 IBM Corporation36 End-to-End Project 2/9/2006
Comparison (Aggregate View), AIX/Power4A: TrWasSovS+U: Trade3, WAS, Sovereign, System+User, 4 CPUs
B: TrWasJ9S+U: Trade3, WAS, J9, System+User, 4 CPUsC: TrWasJ9U: Trade3, WAS, J9, User, 4 CPUs
D: JbbJ9S+U: SpecJBB, J9, System+User, 4 CPUs
E: TPC-C: Native (C) Code 32 CPUs
0.3927 %
0.7619 %
1.1546 %
6.0714 %
1.7619 %
4.3095 %
19.0140 %
3.682
E: TPC-CD: JbbJ9S+UC: TrWasJ9UB: TrWasJ9S+UA: TrWasSovS+U
1.793.3133.7803.8461: CPI
18.96 %23.27 %23.38 %24.43 %2: BR/Inst
1.35 %5.38 %4.64 %4.96 %4: MPRED_TA/BR
5.36 %5.28 %5.23 %6.48 %3: MPRED_CR/BR
6.71 %10.64 %9.87 %11.44 %5: MPRED/BR
1.02 %1.23 %1.22 %1.58 %7: MPRED_CR/Inst
0.26 %1.25 %1.09 %1.21 %8: MPRED_TA/Inst
2.48 % 1.27 %2.27 %2.79 %6: MPRED/Inst:
1. Small Java on J9 (D) shows very good CPI (1.79)
2. Branch Rate: WAS/apps (A – C) > small Java (D), Native code (E)
7. Branch Misprediction (CR: conditional): WAS/apps (A – C) > small Java (D), Native code (E), 2:1
8. Branch Misprediction (TA: target addr): WAS/apps (A – C) >> small Java (D), 4:1; Native (E), 3:1
IBM T. J. Watson Research Center
© 2003 IBM Corporation37 End-to-End Project 2/9/2006
Address Translation
§ Tolerable frequency of TLB & ERAT misses
– 2 - 3 orders of magnitude fewer TLB misses during GC
• Graph fitted using Bezier smoothing -- spikes actually correspond to events that take 0.2 - 0.3s. -- the time of a GC
– ~500 instructions / DTLB miss
– ~25% of DERAT misses result in a TLB Miss à can be expensive
§ Large Pages help!
– DTLB Miss Rate improved by 25%
– ITLB Miss Rate improved by 15%
Much of working set is maintained in ERAT and TLB
Time (seconds)
Mis
s / I
nstr
uctio
n (p
erce
nt)
160
1%
DERAT MissIERAT Updated
DTLB MissITLB Miss
IBM T. J. Watson Research Center
© 2003 IBM Corporation38 End-to-End Project 2/9/2006
Outline
§ End-to-End Optimization Project
– Workload and Server Configurations
§ Methodology
§Performance Characteristics
–Synchronizations§ Summary
IBM T. J. Watson Research Center
© 2003 IBM Corporation39 End-to-End Project 2/9/2006
Synchronization Overhead
§ Mem-sync instructions can occur frequently in multi-threaded server code.
§ Cost is relatively high
Cycle times for memory barriers and atomic read/write
7580lwarx/stwcx
50140sync
25110lwsync
1030isyncPower5Power4
IBM T. J. Watson Research Center
© 2003 IBM Corporation40 End-to-End Project 2/9/2006
Locking Frequence
630
2320
1530
45
30
<<1
1650
<< 1
120
freq [ops/ms]benchmark
jigsaw
hedc
trade6
jbb (16 wh)
jbb (4wh)
jgf_ray
jgf_monte
jgf_mol
jvm98_mtrt
IBM T. J. Watson Research Center
© 2003 IBM Corporation41 End-to-End Project 2/9/2006
Overhead in single-threaded benchmarks
§ IBM’s commercial VM on 4-way Power5 1.6 GHz
§ Removed sync-operations in the JIT code generator
1.04jgf_monte_B
speedup
1.04
1.021.091.041.15
jbb (1 wh)
jgf_monte_Ajvm98_javacjvm98_jackjvm98_db
IBM T. J. Watson Research Center
© 2003 IBM Corporation42 End-to-End Project 2/9/2006
Thread-Local Locks
Fraction of lock operations on thread-local locks [%]:
45.5jigsaw
89.6hedc
30.3
18.9
33.5
81.3
99.6
82.7
99.3
trade6
jbb (16 wh)
jbb (4wh)
jgf_ray
jgf_monte
jgf_mol
jvm98_mtrt
IBM T. J. Watson Research Center
© 2003 IBM Corporation43 End-to-End Project 2/9/2006
Lock locality
Dynamic fraction of lock operations that have locality to <...>:
99.791.799.7jbb (16 wh)
64.5
94.9
91.8
97.6
99.0
98.8
98.2
99.9
proc [%]
98.197.2hedc
thr or proc [%]thr [%]
74.5
99.7
84.8
99.0
99.8
98.4
99.9
76.7trade6
99.7jbb (4wh)
91.8jigsaw
99.2jgf_ray
99.8jgf_monte
98.5jgf_mol
99.9jvm98_mtrt
scheduling dependence
implies thread-local
IBM T. J. Watson Research Center
© 2003 IBM Corporation44 End-to-End Project 2/9/2006
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
BranchesTarget Misses
Condition Misses
ITLB MissDTLB Miss
Tablewalk CyclesIERAT Translation
DERAT Miss
SRQ SyncISLB MissDSLB Miss
L1IL2IL3I
MemL2.5I L2.75I
L3.5ICyc w/ Return
Speculation Rate
D$ Prefetch Stream AllocL2 Prefetches
L1D Prefetches
L1D Load MissL1D Store Miss
L1D Load RefL1D Load Ref
Br
Tra
ns
Mis
cIn
str
Pre
fL1
D$
Sta
tist
ic
CPI Correlation
No single parameter is perfectly correlated with CPI – a balanced system
Branch misprediction are not correlated with CPI
L1 D$ eventsnot strongly correlated with CPI
Positive Correlation Negative Correlation
I$ fetches and Address Translation correlate with CPI
(x − x∑ )(y − y)
(x − x∑ )2 (y − y)2∑
Prefetches and stream alloc. are correlated
with CPI
IBM T. J. Watson Research Center
© 2003 IBM Corporation45 End-to-End Project 2/9/2006
Summary
§ We have presented performance characteristics of Java server workloads
§ Unlike a desktop system, GC is not a big issue
§ They have higher branch-target misprediction rate
§ Data cache miss rates are high – Java meta data cache miss is high– Mostly capacity misses, and low communication misses
§ About 60% of CPU time is in Java, and ½ of that is in the jited code
§ Method profile is flat
§ Quite a large number of redundant memory sync operations
§ No single performance metric has a very high correlation with CPI.
IBM T. J. Watson Research Center
© 2003 IBM Corporation46 End-to-End Project 2/9/2006
Garbage Collection
GC not as significant as past characterization papers have shown
Benchmark Execution Time (HH:MM)
GC
Tim
es (
ms)
Hea
p U
tiliz
atio
n (p
erce
nt)
1:00
30%450
TotalMark
Heap Utilization %Sweep
Time Between (s) 25-28GC Time (ms) 300 - 400
Percent of Runtime 1.30%