steps towards cache-resident transaction …stageddb/papers/stavros_vldb2004.pdfl2/l3 cache l2 cache...

STEPS Towards CacheSTEPS Towards Cache--Resident Resident Transaction ProcessingTransaction Processing

Stavros Harizopoulos

joint work with Anastassia Ailamaki

Carnegie MellonVLDB 2004

Databases2 Carnegie Mellon@

OLTP workloads on modern CPUsOLTP workloads on modern CPUs

• L1-I stalls account for 25-40% of execution time• Instruction caches cannot grow

We need a solution for instruction cache-residency

L2-I stallsL2-D stallsL1-I stallsOther stallsComputation

L1-I cache

Max on-chipL2/L3 cache

L2 cache size256KB 512KB 1MB

CP

I

2

4

6

‘96 ‘00 ‘04‘98 ‘02Year Introduced

10KB

100KB

1MB

10MB

Cac

he s

ize

ServerCPUs


Steps Steps for cachefor cache--resident coderesident code

• Eliminate misses for a group of Xactions– Xactions are assigned to threads– Multiplex execution at fine granularity– Reuse instructions in L1-I cache

STEPS: Synchronized Transactions through Explicit Processor Scheduling


Fewer misses & misspred. branchesFewer misses & misspred. branches

ShoreSteps

Shore

Steps

• Up to 1.4 speedup• Eliminate 96% of L1-I misses for each add’l thread• Eliminate 64% of mispredicted branches

Cycles

L1-I miss

es

Br. Misp

red.

L1-D miss

esN

orm

aliz

ed c

ount

20%

40%

60%

80%

100%

# of

L1-

I mis

ses

1 2 4 8Concurrent threads

Index selection Payment Xaction (TPC-C)

2K

4K

6K

8K


OutlineOutline

• Background & related work• Basic implementation of Steps• Microbenchmarks

– AthlonXP, SimFlex simulator• Applying Steps to OLTP workloads• TPC-C results

– Shore on AthlonXP


BackgroundBackground CPU

L2 cachecode

data data

L1-I cache L1-D cache

for loop {if (?) call BF1F2F3F4 }

B ( ) {B1 }cache

block

- Caches trade size for lookup speed- L1-I misses are expensive

capacitymisses

conflict missesExample: 2-way set associative L1-I

B ( ) {B1 }

for loop {if (?) call BF1F2

F3F4 }


BackgroundBackground CPU

L2 cachecode

data data

L1-I cache L1-D cache

for loop {if (?) call BF1F2F3F4 }

B ( ) {B1 }cache

block

- Caches trade size for lookup speed- L1-I misses are expensive

capacitymisses

conflict missesExample: 2-way set associative L1-I

B ( ) {B1 }

for loop {if (?) call BF1F2

F3F4 }

+ larger cache size+ larger cache sizehigher associativityhigher associativity slower accessslower access

to L1to L1--I cacheI cache slower CPU clockslower CPU clock


Related workRelated work

• Database & Architecture papers: – DB workloads are increasingly non I/O-bound– L2/L3 data misses, L1-I misses– ORACLE OLTP code working set 560KB

• Hardware & compiler approaches– Increase block size, add stream buffer [asplos98]– Call graph prefetching (for DSS) [tocs03]– Code layout optimizations [isca01] [..]


Related work: within the DBMSRelated work: within the DBMS

• Data-cache misses (mostly DSS)– Cache aware page layout, B-trees, join algorithms– Active area [..]

• Instruction-cache misses in DSS– Batch processing of tuples [icde01][sigmod04]

• Instruction-cache misses in OLTPChallenging!


OutlineOutline

• Related work• Basic implementation of Steps• Microbenchmarks• Applying Steps to OLTP workloads• TPC-C results


StepsSteps overviewoverview

• DBMS assign Xactions to threads• Xactions consist of few basic operators

– Index select, scan, update, insert, delete, commit

• Steps groups threads per Op• Within each Op reuse instructions

I-cache aware context-switching


II--cache aware contextcache aware context--switchingswitching

BEFORE

code fits inI-cache

context-switchpoint(CTX)

select( )s1s2s3s4s5s6s7

thread 1

CPU executes codeCPU performs context-switch (CTX)

thread 2

instructioncache

AFTER

thread 1 thread 2

select( )s1s2s3s4s5s6s7

select( )s1s2s3

s4s5s6s7

select( )s1s2s3

s4s5s6s7

MissMMMMMMM

MMM

M

MMM

M

HHH

H

HHH

Hit

MMMMMM

MM


Basic implementation on Basic implementation on ShoreShore

• Assume (for now)– Threads interested in same Op– Uninterrupted flow (no locks, I/O)

Fast, small, compatible CTX code– 76 bytes, bypass (for now) full CTX

Add CTX calls throughout Op source code– Use hardware counters (PAPI) on sample Op


OutlineOutline



Microbenchmark setupMicrobenchmark setup

• All experiments on index fetch, in-memory index– 45KB footprint

• Fast CTX for both Steps /Shore, warm cache

vary all cache parametersSimulated

IA-32SimFlex

256KBL2 cache size

64KB + 64KB2-way

64 bytes

L1 I + D cache sizeassociativity

block sizeAMD

AthlonXP


L1L1--I cache missesI cache misses

Shore Steps

Steps eliminates 92-96% of misses for add’l threads• All misses are conflict misses (cache is 64KB)

AthlonXPAthlonXP


L1-I

cach

e m

isse

s

8 10

1K

2K

3K

4K


L1L1--I misses & speedupI misses & speedup

L1-I Miss reduction Upper LimitL1-I Miss reduction %

Steps achieves max performance for 6-10 threads• No need for larger thread groups

AthlonXPAthlonXPM

iss

redu

ctio

n 100%

80%

60%

40%10 20 30 40

Concurrent threads50 60 70 80


L1L1--I misses & speedupI misses & speedup

L1-I Miss reduction Upper LimitL1-I Miss reduction %

Steps achieves max performance for 6-10 threads• No need for larger thread groups

AthlonXPAthlonXPS

peed

up

1.11.21.31.4

Mis

s re

duct

ion 100%

80%

60%

Speedup

40%


50 60 70 80


Smaller L1Smaller L1--I cacheI cache

AthlonXPPentium III

209%

Steps outperforms Shore even on smaller caches (PIII)• 62-64% fewer mispred. branches on both CPUs

AthlonXP, PIIIAthlonXP, PIII10 threads10 threads

Nor

mal

ized

cou

nt

Cycles

L1-I miss

es

Br. Misp

red.

L1-D miss

es

Branches

20%40%60%80%

100%120%

Br. miss

ed BTB Instr. stalls(cycles)


SimFlex: L1SimFlex: L1--I missesI missesShore-16KBSteps-16KBMIN Shore-32KB

Steps-32KBMIN

Shore-64KBSteps-64KBMIN

Steps eliminates all capacity misses (16, 32KB caches)• Up to 89% overall miss reduction (upper limit is 90%)

higherassociativity

64b cache block64b cache block10 threads10 threads

L1-I

cach

e m

isse

s

2K

4K

6K

8K

10K

direct2-way

4-way8-way

full higherassociativity

AthlonXP


OutlineOutline



Design goalsDesign goals

• High concurrency on similar Ops– Cover full spectrum of Ops

• Correctness & low overhead for :– Locks, latches, mutexes– Disk I/O– Exceptions (abort & roll back)– Housekeeping (detect deadlock, buffer pool)


OverviewOverview

1. Thin wrappers per Op to sync Xactions– Form Execution Teams per Op– Flexible definition of Op

2. Best-effort within execution teams– Fast CTX through fixed scheduling– Threads leave team on exceptions

3. Repair thread structures at exceptions– Modify only thread package


System designSystem design

• Threads go astray on exceptions• Regroup at next Op• Can have execution teams per database table

steps wrapper

Op X

steps wrapper

Op Y

steps wrapper

Op Z

stray thread

executionteam

to otherOp


OutlineOutline



Experimentation setupExperimentation setup

• Shore/Steps : AthlonXP, 2GB RAM, 2 disks• Shore locking

– Hierarchy: record, page, table, DB– Protocol: 2-phase

• TPC-C : Wholesale parts supplier– 10-30 Warehouses, 100-300 users

• Increased concurrency though– Zero think time TPC-C workload– In-memory database, lazy commits


One Xaction: paymentOne Xaction: payment10 20 30

Steps outperforms Shore• 1.4 speedup, 65% fewer L1-I misses• 48% fewer mispredicted branches• For 10 Warehouses: 15 ready threads, 7 threads / team

Number of Warehouses

Nor

mal

ized

cou

nt

20%

40%

60%

80%

100%

Cycles L1-Imisses

L1-Dmisses

L2-Imisses

L2-Dmisses

Branchesmispred.


Mix of four XactionsMix of four Xactions10 20

Number of Warehouses

Nor

mal

ized

cou

nt

20%

40%

60%

80%

100%

Cycles L1-Imisses

L1-Dmisses

L2-Imisses

L2-Dmisses

Branchesmispred.

• Xaction mix reduces average team size (4.3 in 10W)• Still, Steps has 56% fewer L1-I misses (out of 77% max)

121% 125%


Summary of resultsSummary of results

• Steps can handle full OLTP workloads• Significant improvements in TPC-C

– 65% fewer L1-I misses– 48% fewer mispredicted branches

• Room for improvement– Steps was not tuned for TPC-C– Shore ’s code yields low concurrency

StepsSteps minimizes both capacity & conflict missesminimizes both capacity & conflict misseswithout increasing Iwithout increasing I--cache size / associativitycache size / associativity


Thank you

steps towards cache-resident transaction …stageddb/papers/stavros_vldb2004.pdfl2/l3 cache l2 cache...

Documents