steps towards cache-resident transaction …stageddb/papers/stavros_vldb2004.pdfl2/l3 cache l2 cache...
TRANSCRIPT
STEPS Towards CacheSTEPS Towards Cache--Resident Resident Transaction ProcessingTransaction Processing
Stavros Harizopoulos
joint work with Anastassia Ailamaki
Carnegie MellonVLDB 2004
Databases2 Carnegie Mellon@
OLTP workloads on modern CPUsOLTP workloads on modern CPUs
• L1-I stalls account for 25-40% of execution time• Instruction caches cannot grow
We need a solution for instruction cache-residency
L2-I stallsL2-D stallsL1-I stallsOther stallsComputation
L1-I cache
Max on-chipL2/L3 cache
L2 cache size256KB 512KB 1MB
CP
I
2
4
6
‘96 ‘00 ‘04‘98 ‘02Year Introduced
10KB
100KB
1MB
10MB
Cac
he s
ize
ServerCPUs
Databases3 Carnegie Mellon@
Steps Steps for cachefor cache--resident coderesident code
• Eliminate misses for a group of Xactions– Xactions are assigned to threads– Multiplex execution at fine granularity– Reuse instructions in L1-I cache
STEPS: Synchronized Transactions through Explicit Processor Scheduling
Databases4 Carnegie Mellon@
Fewer misses & misspred. branchesFewer misses & misspred. branches
ShoreSteps
Shore
Steps
• Up to 1.4 speedup• Eliminate 96% of L1-I misses for each add’l thread• Eliminate 64% of mispredicted branches
Cycles
L1-I miss
es
Br. Misp
red.
L1-D miss
esN
orm
aliz
ed c
ount
20%
40%
60%
80%
100%
# of
L1-
I mis
ses
1 2 4 8Concurrent threads
Index selection Payment Xaction (TPC-C)
2K
4K
6K
8K
Databases5 Carnegie Mellon@
OutlineOutline
• Background & related work• Basic implementation of Steps• Microbenchmarks
– AthlonXP, SimFlex simulator• Applying Steps to OLTP workloads• TPC-C results
– Shore on AthlonXP
Databases6 Carnegie Mellon@
BackgroundBackground CPU
L2 cachecode
data data
L1-I cache L1-D cache
for loop {if (?) call BF1F2F3F4 }
B ( ) {B1 }cache
block
- Caches trade size for lookup speed- L1-I misses are expensive
capacitymisses
conflict missesExample: 2-way set associative L1-I
B ( ) {B1 }
for loop {if (?) call BF1F2
F3F4 }
Databases7 Carnegie Mellon@
BackgroundBackground CPU
L2 cachecode
data data
L1-I cache L1-D cache
for loop {if (?) call BF1F2F3F4 }
B ( ) {B1 }cache
block
- Caches trade size for lookup speed- L1-I misses are expensive
capacitymisses
conflict missesExample: 2-way set associative L1-I
B ( ) {B1 }
for loop {if (?) call BF1F2
F3F4 }
+ larger cache size+ larger cache sizehigher associativityhigher associativity slower accessslower access
to L1to L1--I cacheI cache slower CPU clockslower CPU clock
Databases8 Carnegie Mellon@
Related workRelated work
• Database & Architecture papers: – DB workloads are increasingly non I/O-bound– L2/L3 data misses, L1-I misses– ORACLE OLTP code working set 560KB
• Hardware & compiler approaches– Increase block size, add stream buffer [asplos98]– Call graph prefetching (for DSS) [tocs03]– Code layout optimizations [isca01] [..]
Databases9 Carnegie Mellon@
Related work: within the DBMSRelated work: within the DBMS
• Data-cache misses (mostly DSS)– Cache aware page layout, B-trees, join algorithms– Active area [..]
• Instruction-cache misses in DSS– Batch processing of tuples [icde01][sigmod04]
• Instruction-cache misses in OLTPChallenging!
Databases10 Carnegie Mellon@
OutlineOutline
• Related work• Basic implementation of Steps• Microbenchmarks• Applying Steps to OLTP workloads• TPC-C results
Databases11 Carnegie Mellon@
StepsSteps overviewoverview
• DBMS assign Xactions to threads• Xactions consist of few basic operators
– Index select, scan, update, insert, delete, commit
• Steps groups threads per Op• Within each Op reuse instructions
I-cache aware context-switching
Databases12 Carnegie Mellon@
II--cache aware contextcache aware context--switchingswitching
BEFORE
code fits inI-cache
context-switchpoint(CTX)
select( )s1s2s3s4s5s6s7
thread 1
CPU executes codeCPU performs context-switch (CTX)
thread 2
instructioncache
AFTER
thread 1 thread 2
select( )s1s2s3s4s5s6s7
select( )s1s2s3
s4s5s6s7
select( )s1s2s3
s4s5s6s7
MissMMMMMMM
MMM
M
MMM
M
HHH
H
HHH
Hit
MMMMMM
MM
Databases13 Carnegie Mellon@
Basic implementation on Basic implementation on ShoreShore
• Assume (for now)– Threads interested in same Op– Uninterrupted flow (no locks, I/O)
Fast, small, compatible CTX code– 76 bytes, bypass (for now) full CTX
Add CTX calls throughout Op source code– Use hardware counters (PAPI) on sample Op
Databases14 Carnegie Mellon@
OutlineOutline
• Related work• Basic implementation of Steps• Microbenchmarks• Applying Steps to OLTP workloads• TPC-C results
Databases15 Carnegie Mellon@
Microbenchmark setupMicrobenchmark setup
• All experiments on index fetch, in-memory index– 45KB footprint
• Fast CTX for both Steps /Shore, warm cache
vary all cache parametersSimulated
IA-32SimFlex
256KBL2 cache size
64KB + 64KB2-way
64 bytes
L1 I + D cache sizeassociativity
block sizeAMD
AthlonXP
Databases16 Carnegie Mellon@
L1L1--I cache missesI cache misses
Shore Steps
Steps eliminates 92-96% of misses for add’l threads• All misses are conflict misses (cache is 64KB)
AthlonXPAthlonXP
1 2 4 6Concurrent threads
L1-I
cach
e m
isse
s
8 10
1K
2K
3K
4K
Databases17 Carnegie Mellon@
L1L1--I misses & speedupI misses & speedup
L1-I Miss reduction Upper LimitL1-I Miss reduction %
Steps achieves max performance for 6-10 threads• No need for larger thread groups
AthlonXPAthlonXPM
iss
redu
ctio
n 100%
80%
60%
40%10 20 30 40
Concurrent threads50 60 70 80
Databases18 Carnegie Mellon@
L1L1--I misses & speedupI misses & speedup
L1-I Miss reduction Upper LimitL1-I Miss reduction %
Steps achieves max performance for 6-10 threads• No need for larger thread groups
AthlonXPAthlonXPS
peed
up
1.11.21.31.4
Mis
s re
duct
ion 100%
80%
60%
Speedup
40%
10 20 30 40Concurrent threads
50 60 70 80
Databases19 Carnegie Mellon@
Smaller L1Smaller L1--I cacheI cache
AthlonXPPentium III
209%
Steps outperforms Shore even on smaller caches (PIII)• 62-64% fewer mispred. branches on both CPUs
AthlonXP, PIIIAthlonXP, PIII10 threads10 threads
Nor
mal
ized
cou
nt
Cycles
L1-I miss
es
Br. Misp
red.
L1-D miss
es
Branches
20%40%60%80%
100%120%
Br. miss
ed BTB Instr. stalls(cycles)
Databases20 Carnegie Mellon@
SimFlex: L1SimFlex: L1--I missesI missesShore-16KBSteps-16KBMIN Shore-32KB
Steps-32KBMIN
Shore-64KBSteps-64KBMIN
Steps eliminates all capacity misses (16, 32KB caches)• Up to 89% overall miss reduction (upper limit is 90%)
higherassociativity
64b cache block64b cache block10 threads10 threads
L1-I
cach
e m
isse
s
2K
4K
6K
8K
10K
direct2-way
4-way8-way
full higherassociativity
AthlonXP
Databases21 Carnegie Mellon@
OutlineOutline
• Related work• Basic implementation of Steps• Microbenchmarks• Applying Steps to OLTP workloads• TPC-C results
Databases22 Carnegie Mellon@
Design goalsDesign goals
• High concurrency on similar Ops– Cover full spectrum of Ops
• Correctness & low overhead for :– Locks, latches, mutexes– Disk I/O– Exceptions (abort & roll back)– Housekeeping (detect deadlock, buffer pool)
Databases23 Carnegie Mellon@
OverviewOverview
1. Thin wrappers per Op to sync Xactions– Form Execution Teams per Op– Flexible definition of Op
2. Best-effort within execution teams– Fast CTX through fixed scheduling– Threads leave team on exceptions
3. Repair thread structures at exceptions– Modify only thread package
Databases24 Carnegie Mellon@
System designSystem design
• Threads go astray on exceptions• Regroup at next Op• Can have execution teams per database table
steps wrapper
Op X
steps wrapper
Op Y
steps wrapper
Op Z
stray thread
executionteam
to otherOp
Databases25 Carnegie Mellon@
OutlineOutline
• Related work• Basic implementation of Steps• Microbenchmarks• Applying Steps to OLTP workloads• TPC-C results
Databases26 Carnegie Mellon@
Experimentation setupExperimentation setup
• Shore/Steps : AthlonXP, 2GB RAM, 2 disks• Shore locking
– Hierarchy: record, page, table, DB– Protocol: 2-phase
• TPC-C : Wholesale parts supplier– 10-30 Warehouses, 100-300 users
• Increased concurrency though– Zero think time TPC-C workload– In-memory database, lazy commits
Databases27 Carnegie Mellon@
One Xaction: paymentOne Xaction: payment10 20 30
Steps outperforms Shore• 1.4 speedup, 65% fewer L1-I misses• 48% fewer mispredicted branches• For 10 Warehouses: 15 ready threads, 7 threads / team
Number of Warehouses
Nor
mal
ized
cou
nt
20%
40%
60%
80%
100%
Cycles L1-Imisses
L1-Dmisses
L2-Imisses
L2-Dmisses
Branchesmispred.
Databases28 Carnegie Mellon@
Mix of four XactionsMix of four Xactions10 20
Number of Warehouses
Nor
mal
ized
cou
nt
20%
40%
60%
80%
100%
Cycles L1-Imisses
L1-Dmisses
L2-Imisses
L2-Dmisses
Branchesmispred.
• Xaction mix reduces average team size (4.3 in 10W)• Still, Steps has 56% fewer L1-I misses (out of 77% max)
121% 125%
Databases29 Carnegie Mellon@
Summary of resultsSummary of results
• Steps can handle full OLTP workloads• Significant improvements in TPC-C
– 65% fewer L1-I misses– 48% fewer mispredicted branches
• Room for improvement– Steps was not tuned for TPC-C– Shore ’s code yields low concurrency
StepsSteps minimizes both capacity & conflict missesminimizes both capacity & conflict misseswithout increasing Iwithout increasing I--cache size / associativitycache size / associativity
Databases30 Carnegie Mellon@
Thank you