s trex boosting instruction cache reuse in oltp workloads through stratified transaction execution...

STRexBoosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution

Islam Atta Pınar Tözün* Xin Tong

Anastasia Ailamaki* Andreas

Moshovos

Shark #1Shark #2

Starfish

Chocolate Base: http://www.marthastewart.com/337010/chocolate-cupcakesVanilla Base: http://www.marthastewart.com/256334/vanilla-cupcakes

Swiss Meringue Buttercream: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes

Had only one of these

Shark #1 Shark 2 Starfish

Empty, Wash, Fill

23 123 123

Sssshhh…

Empty/Wash/Fill

23 123 123

Shark #1 Shark 2 Starfish

When executing OLTP Transactions Processors aren’t as clever

DB operations

Transaction

DB Query

Instruction Cache

Processor

Icing Cakes and OLTP Transactions

Transaction #1 Transaction #2 Transaction #3

Today’s Systems

Instruction Misses

Better Way

Unlike Icing Cakes…

Transaction Operations

UnclearBoundaries

RepeatedConditional Different

Dynamic Hardware Solution

• Breaks execution into L1I-sized sub-problems

• Time-multiplex to improve locality

Performance

Reduces instruction misses by up to 44%

Reduces data misses by up to 37%

Improves throughput by 35-55% for 2-16 cores

Robust:

• Non-OLTP workload remains unaffected

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• SLICC and its limitations

• Results

• Summary

Roadmap

$100 Billion/Yr, +10% annually

•E.g., banking, online purchases, stock market…

Benchmarking

•Transaction Processing Council

•TPC-C: Wholesale retailer

•TPC-E: Brokerage market

Online Transaction Processing (OLTP)

OLTP drives innovation for HW and DB vendors

Many concurrent transactions

Transactions Suffer from Instruction Misses

L1-I size

Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta

Many concurrent transactions

Few DB operations

•28 – 65KB

Few transaction types

•TPC-C: 5, TPC-E: 12

Transactions fit in 128-512KB

OLTP Facts

Overlap within and across different transactions

R() U() I() D() IT() ITP()

PaymentNew Order

CMPs’ aggregate L1-I cache is large enough© Islam Atta

Temporal Code Redundancy

0 10 20 30 40 50 60 70 80 90 1001101201301401501601701801900%

K-Instructions

Payment

Transactions perform similar operations in similar sequence

Why Is There So Much Instruction Overlap?

Payment

IT(CUST)

R(DIST)

R(CUST)

U(CUST)

U(DIST)

I(HIST)

New Order

R(DIST)

I(NORD)

U(DIST)

R(CUST)

R(ITEM)

R(STO)

U(STO)

I(ORD)

Loop (OL_CNT)

Condition

Transactions are built using few DB operations

Similar transactions perform similar operations

Transaction #1 Transaction #2 Transaction #3

Today’s Systems

Instruction Misses

Stratified Execution

Challenges

UnclearBoundaries

Repeated Conditional Different

Generalized Transaction Scheduling

NP-Complete Heuristic needed © Islam Atta

“When you cannot solve a problem… think of a problem you can solve”

Pikos Apikos, MCMLXXXV

Identical Transactions

Conventional

Scheduling Identical Transactions

A B C A B C A B C

A A A B B B C C C

Transaction A

Transaction B

Transaction C

Miss Overhead Time

A AATransaction A

B BBTransaction B

C CCTransaction C

Phase 1 Phase 2 Phase 3

Optimal Scheduling for Identical Transactions

Phase 1

Transaction A

Transaction B

Transaction C

Phase 2 Phase 3

Do not evict a red block

Implementation

Phase 1

Transaction A

Transaction B

Transaction C

1. Same-type transaction groups

2. First thread Lead

3. Phase # starts at ONE

4. Touched blocks marked with current phase #

5. Victim block tagged with current phase # switch thread

6. Lead thread increments phase #

Phase 2 Phase 3

TimeWorks Well for the General Case

Roadmap

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• Results

• Summary

SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

I. Atta, P. Tözün, A. Ailamaki, A. Moshovos

MICRO-45, December, 2012.

SLICC Concept

Technology:

• CMP’s aggregate L1

instruction cache capacity

is large enough

Multiple L1-I caches

Multiple threads

SLICC is similar to icing cackes with multiple icing

Condition: Aggregate cache capacity is sufficient

SLICC was Demonstrated on 16 cores

SLICC Needs Enough Cores

Few cores

Larger Footprint

Can these happen in practice?

1. Data center constraints limit core count

2. Increasing instruction footprints

Multiple L1-I caches

Roadmap

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• Results

• Summary

Simulation

•Zesto (x86) (thank you to GTech)

•2-16 OoO cores, 32KB 8-way L1-I and L1-D, 1MB per core L2

•QTrace (Xin Tong’s QEMU extension)

Workloads

Methodology

Shore-MT

Effect on INSTRUCTION and DATA misses?

L1-I (instruction locality), L1-D (data sharing)

Performance impact:

Are CONTEXT SWITCHING OVERHEADS amortized?

Compared to SLICC

Measure sensitivity to available CORE COUNT

Experimental Evaluation

Baseline: no effort to reduce instruction misses

SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]

L1 Miss per Kilo Instructions (MPKI): Instructions

2 cores

4 cores

8 cores

16 cores

2 cores

4 cores

8 cores

16 cores

TPC-C-10 TPC-E

STREXSLICCBaseline

Baseline: no effort to reduce instruction misses

SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]

L1 Miss per Kilo Instructions (MPKI): Data

2 cores

4 cores

8 cores

16 cores

2 cores

4 cores

8 cores

16 cores

TPC-C-10 TPC-E

STREXSLICCBaseline

Throughput

2-core

4-core

8-core

16-core

2-core

4-core

8-core

16-core

TPC-C-10 TPC-E

7 Base SLICC STREX STREX+SLICC

tive T

Dynamic Hardware Solution

• Breaks execution into L1I-sized sub-problems

• Time-multiplex to improve locality

Performance

Reduces instruction misses by up to 44%

Reduces data misses by up to 37%

Improves throughput by 35-55% for 2-16 cores

Robust:

• Non-OLTP workload remains unaffected

OLTP’s performance suffers due to instruction stalls

Application Opportunities: temporal code redundancy

SLICC: Thread Migration

• Sensitive to runtime core count

STREX: Thread Stratification

• Synchronize transaction execution on a single core

• Improve L1 instruction (and data) locality

Hybrid: Best of both Worlds

Summary

Email: iatta@eecg.toronto.eduWebsite: http://islamatta.com Thanks!

Larger L1-I caches? [DaMoN’12]

Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce

1.4Conflict Capacity Compulsory Speedup

Cache Size (K)

STREX with Identical Transactions

Delivery

TPC-C TPC-E

45 Baseline CTX-Identical

Replacement Policies

TPC-C TPC-E0

LRULIPBIPSRRIPBRRIPSTREX+LRUSTREX+BIPSTREX+BRRIP

Kilo I

Thread Latency Trade-off

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

e0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Baseline (6.37) STREX-2T (5.96)STREX-4T (10.48) STREX-6T (15.25)STREX-8T (17.42) STREX-10T (14.83)STREX-12T (21.04) STREX-16T (21.77)STREX-20T (29.68) SLICC-2 (23.00)SLICC-4 (12.80) SLICC-8 (6.95)SLICC-16 (7.49)

M-Cycles

2 4 6 8 10 12 16 200

TPC-CTPC-E

Zesto (x86)

Qtrace (QEMU extension)

Shore-MT

Detailed Methodology

Focus on OLTP

•Important Class of Applications

•Instruction stalls dominate performance

Other workloads?

•Data Serving

•Media Streaming

•Web Frontend

•SPECweb 2009

•Web Backend

Workloads

Similar to OLTP

Hardware Cost

Hybrid

s trex boosting instruction cache reuse in oltp workloads through stratified transaction execution...

Documents

atta filling machine

mohamed atta

bid package tg06.1, addendum #1 - salesforce transit...

atta presentation 10.23.09

andre atta micro tonality

atta attaay

by olcay yücel Çulha pınar Özbay ÖzlÜ cihan yalÇin

pınar baykara, geleneksel yemekler

one atta time

annual report - pinar...food and beverage group food •...

tuğba metİn & pınar ennelİ - hakkârili kürt müslüman...

atta assignment

pinar sÜt mamullerİ san....

2011 atta presentation indaba

new atta flour process.€¦ · new atta flour process. new...

packaged atta

terapötikilaç analizlerinde preanalitik değişkenler...

preison grinding atta hment

pınar süt annual report 2009 -...

rajin's atta boys pedigree.pdf · rajin's atta boy dawn...