latte-cc: latency tolerance aware adaptive cache...

LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy

Efficient GPUs

Akhil Arunkumar, Shin-Ying Lee, Vignesh Soundararajan, Carole-Jean WuSchool of Computing, Informatics and Decision Systems Engineering

Arizona State University

24th IEEE International Symposium on High-Performance Computer Architecture

Accelerate parallel applications• Scientific simulations• Genomics• Artificial intelligence

GPU Computing is Ubiquitous

SM - NSM - 2

Streaming Multiprocessor - 1GPU L1 Caches• 16 KB – 128 KB• 1000s of concurrent threads • 10s of bytes / thread• Severe cache thrashing

• Ample data locality is available• 2x cache è 50 % speedup• Not exploited due to

thrashing

2/21L1 Cache

...Warp-0 Warp-1 Warp-N

Motivation: Data Cache is Inefficiently Used

Need to utilize data cache capacity better

1000s of Concurrent Threads

• GPU cache bypassing• MRPB [HPCA’14]

• Adaptive bypassing [GPGPU’15]

• PCAL [HPCA’15]

• Ctrl-C [ICCD’16]

• ID-Cache [IISWC’16]

• Others

•Warp scheduling• 2-Level [ISCA’11, MICRO’11]

• CCWS [MICRO’12]

• DAWS [MICRO’13]

• CAWA [ISCA’15]

• Others

Reduce TLP or hard to recover from inaccuracies

Prior Work

• Data compression• CPUs – DRAMs, interconnect, and last level caches

• GPUs – Interconnect and register files

• Cache compression(+) Increased effective cache capacity

(-) Decompression latency is on the critical path

• GPUs are known to be latency tolerant

Can Data Compression be Applied to GPU Caches?

DecompressorCompressed Cache

To RequestorDecompressionLatency

Can we exploit GPU latency tolerance for data cache compression?

• Introduction and Background

• Motivation for L1 cache compression in GPUs

• LATTE-CC: Latency Tolerance Aware Cache Compression Management

• Methodology and Evaluation

• Conclusion

Outline

0.60.8

11.21.41.61.8

22.22.4

BFS KM PF SS MM BC MIS CLR FW PRK DJK Avg

BDI-NoLatency SC-NoLatency BDI-WithLatency SC-WithLatency

• GPU workloads are compression friendly (2x – 3.6x compression ratio)• Some workloads show affinity to compression algorithms• Static application of compression results in unpredictability &

lost potential 6/21

[1] [2]

[1] Pekhimenko et al. “Base-delta-immediate compression: practical data compression for on-chip caches” in PACT 2012[2] Arelakis and Stenstrom. “SC2: A statistical compression cache scheme” in ISCA 2014

[1] [2]

Motivation for L1 Cache Compression

Compression FriendlySC Friendly

• GPU applications possess different extents of latency tolerance

•Available latency tolerance varies over application execution phases

010203040

PRK MIS CLR DJK BC FW

High Moderate Low

%Degradatio

Performance

LatencyTolerance

+2Cycles +14Cycles

LatencyToleranceofSS

Application(Cycles)

(BDI) (SC)

GPU Latency Tolerating Ability

• Conclusion

Outline

L1 Data CacheL1 Data

• On Miss• Data from L2 cache is given to the compressor• Compressed data is placed in the L1 data cache

• Increase number of tags available in the cache (e.g. sector like cache with 4x tags)

• On Hit• Compressed data is given to the decompressor• Decompressed data is sent to the SM

SM-3SM-2

Streaming Multiprocessor (SM-1)

To L2 CacheWar

WarpPool

. . .SIMD Lanes

Warp Scheduler

decompressed data compressed

LATTE-CC

compressed data

Tag;Compression_Encoding;

decompressed data

L1 Data Cache

Background on Cache Compression

L1 Data CacheL1 Data

• Compression Mode Prediction• BDI, SC and Default (no compression)

• Latency Tolerance Estimation• Capacity Benefit Estimation

SM-3SM-2

Streaming Multiprocessor (SM-1)

To L2 CacheWar

WarpPool

. . .SIMD Lanes

Warp Scheduler

decompressed data compress

ed data

LATTE-CC

compressed data

Tag;Compression_policy;Compression_Encoding;

Compression mode

Latency Tolerance Estimation

rsCapacity Benefit Estimation

L1 Data Cache

LATTE-CC: Latency Tolerance Aware Cache Compression

Combine latency tolerance, capacity benefit, and decompression latency into one metric èAMAT• Minimize Average Memory Access Time (AMAT)

Accommodate application phases• Divide execution time è multiple experimental phases

• Estimate AMAT of 3 compression modes at each EP

• Choose compression mode with lowest AMAT

LATTE-CC

Decompression Latency

Execution Time

EP-1 EP-m EP-m+1 . . . . . EP-(N)

Compression Mode Selection

AMAT is defined as

AMAT = total hit latency + total miss latencyNhits +Nmisses

In the context of the compression modes,AMATmode = fn(

GPU latency tolerance,

Nhits-mode , Nmisses-mode

hit latency , miss latency ,

decompression latencymode ,

decompressor queuing delaymode )

LATTE-CC

AMAT Estimation

LATTE-CC

We leverage the warp pool to estimate GPU latency tolerance

Streaming Multiprocessor

WarpPool

. . .SIMD Lanes

Warp Scheduler

• GPU hides latency by swapping a stalled warp with a ready warp for execution

• Number of available ready warps è degree of latency tolerance

Latency tolerance = (ready warps available) * (insts executed per warp)

• Dynamic latency tolerance estimation for every EP

Latency Tolerance Estimation

Execution Time

EP-1 EP-m EP-m+1 . . . . . EP-(N)

LATTE-CC

14/21[1] Quereshi et al. “A case for MLP-aware cache replacement” in ISCA 2006

Adaptive Phase

Execution TimeEP-1 EP-m EP-m+1 . . . . . EP-(N)Learning Phase

Default Mode Dedicated SetBDI Mode Dedicated Set

SC Mode Dedicated SetLearning Phase Follower Set (BDI)

• Use few EPs periodically to learn cache capacity benefit

• Use modified set-sampling[1] method to measure hits and misses incurred by different compression modes

Learning Phase . . . EP-(N+1) . . . Nhit_Default

Nmiss_Default

Nhit_BDI

Nmiss_BDI

Nhit_SC

Nmiss_SC

Capacity Benefit Estimation

Execution TimeEP-1 EP-m EP-m+1 . . . . . EP-(N)

. . . . . . . . . . . .

Default Mode Set

BDI Mode Set

SC Mode SetLearning Phase Follower Set (BDI)

Adaptive PhaseLearning Phase Learning Phase . . . EP-(N+1) . . .

• Learning Phase: • Cache capacity benefit estimation• Use few EPs periodically

• Adaptive Phase:• Estimate latency tolerance for each EP• Estimate AMATmode for each EP and choose

best compression mode

Putting It All Together

LATTE-CC

• Conclusion

Outline

• GPU parameters• 15 SMs• 16 kB L1D cache • 768 kB L2 cache • GTO warp scheduler

[1] Che et al. “Rodinia: A benchmark suite for heterogeneous computing” in IISWC 2009[2] He et al. “Mars: A mapreduce framework on graphics processors” in PACT 2008[3] Che et al. “Pannotia: Understanding irregular GPGPU graph applications” in IISWC 2013[4] NVIDIA, “CUDA C/C++ SDK code samples

Methodology

• LATTE-CC cache parameters• 4x Tags• EP – 256 accesses• Compression & Decompression

Latency• BDI – 2 / 2 Cycles• SC – 6 / 14 Cycles

• GPUWattch• Compression & decompression

energy• BDI – 0.192 / 0.056 nJ• SC – 0.42 / 0.336 nJ

• Benchmarks• 22 benchmarks from Rodinia[1],

Mars[2], Pannotia[3], and CUDA SDK[4]

• 11 cache sensitive (C-Sens) and 11 cache insensitive (C-InSens)

KM PF SS

MM BC MIS

CLR FW PRK

C-Sens

BDI SC LATTE-CC

• Static application of compression leads to performance variability• Fine-grain adaption of LATTE-CC results in high performance

improvement

LATTE-CC: Performance

0102030

Reduction in Misses (%)

Speedup

• Prioritizing hit counts leads to sub-optimal performance.• Not considering latency tolerance leads to sub-optimal

performance

19/21[1] Alameldeen and Wood. “Adaptive cache compression for high-performance processors” in ISCA 2004

LATTE-CC: Benefit of Latency Tolerance

KM PF SS

MM BC MIS

CLR FW PRK

C-Sens Avg

Data Movement Energy Static EnergyL2 Cache Energy DRAM EnergyOther Energy Compr Energy

• LATTE-CC reduces energy consumption by 10%• Data movement and static energy significantly reduced

• 4.2% reduction due to data movement energy• 3.7% reduction due to static energy

LATTE-CC: Sources of Energy Reduction

• This is the first work to explore cache compression for GPUs• GPU workloads are compression friendly (2x – 3.6x compression ratio)• Decompression latency is hidden by variable extents• To maximize compression benefit, an adaptive compression management

system is needed

• We proposed LATTE-CC to:• Exploit GPU latency tolerance • Perform efficient cache compression on GPU L1 caches

• LATTE-CC achieves • 20% performance improvement• 10% energy reduction

Conclusion

LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy

Efficient GPUs

Akhil Arunkumar, Shin-Ying Lee, Vignesh Soundararajan, Carole-Jean WuSchool of Computing, Informatics and Decision Systems Engineering

Arizona State University

24th IEEE International Symposium on High-Performance Computer Architecture

Thank you

latte-cc: latency tolerance aware adaptive cache...

Documents

performance characteristics of the power8 processor...l3: 96...

ec-cache: load-balanced, low-latency cluster caching with

agenda –day 1 · l1 cache l2 cache l3 cache disk 1000...

suricata extreme performance tuning...miss latency latency,...

networkmultis.1 review: bus connected smps (umas) caches...

คู่มือการใช้งาน · 7 1 2 1 2...

latte: improving the latency oftransiently consistent...

memory organization - university of minnesota · memory...

i-sim: an instruction cache simulator · cache (and thus...

thanks w thanksw thanks w thanks w thanks w thanks w latte...

register cache system not for latency reduction purpose...

the locality-aware adaptive cache coherence...

caffè latte

· assume an instruction cache miss rate for gcc is 2% and...

improving database performance with dell fluid cache for...

yee vang web cache. introduction internet has many user...

latte news brochure_proof2

ec-cache load-balanced, low-latency cluster caching with...

register cache system not for latency reduction purpose

best latte recipes