memory hierarchy for web search - stanford...

27
Memory Hierarchy for Web Search Grant Ayers * , Jung Ho Ahn , Christos Kozyrakis * , Partha Ranganathan Stanford University * Seoul National University Google Work performed while authors were at Google

Upload: others

Post on 23-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Memory Hierarchy for Web Search

Grant Ayers*�, Jung Ho Ahn†�, Christos Kozyrakis*, Partha Ranganathan‡

Stanford University*Seoul National University†

Google‡

�Work performed while authors were at Google

Page 2: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

2

The world is headed toward cloud-based services

...and we’re still optimizing for SPEC

Page 3: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Research Objective

Design tomorrow’s CPU architectures for OLDI workloads like web search

1. Provide the first public in-depth study of the microarchitecture and memory system behavior of commercial web search

2. Propose new performance optimizations with a focus on the memory hierarchy

Results show 27% performance improvement, and 38% with future devices.

3

Page 4: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Understanding Web Search on Current Architectures

4

Page 5: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Google’s web search is scalable

Scalability / Hardware Optimizations

● Linear core scaling● Not bandwidth or I/O bound● SMT (+37%), huge pages

(+11%), hardware prefetching (+5%)

● Architects can assume excellent software scaling

5

Page 6: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Google web search performance on Intel Haswell

6

Web search leaf node CPU utilization

Stalls

Page 7: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Memory Hierarchy Characterization

7

Page 8: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Challenges and Methodology

Challenges

1. No known timing simulator can run search for non-trivial amount of virtual time2. Performance counters are limited and often broken

Methodology

● Measurements from real machines● Trace-driven functional cache simulation (Intel Pin, 135 billion instructions)● Analytical performance modeling

8

Page 9: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Working set scaling

Memory accessed in steady-state

● Shard footprint is constant, but touched footprint grows with cores and time (little data locality in the shard)

● Heap working set converges around 1 GiB, suggests sharing and cold structures

9

Page 10: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Overall cache effectiveness

● L1 and L2 caches experience significant misses of all types

● L3 cache virtually eliminates code misses but is insufficient for heap and shard

What’s the ideal L3 size?

10

Page 11: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L3 cache scaling

● 16 MiB sufficiently removes code misses

● Not even 2 GiB captures the shard

● 1 GiB would capture the heap

11

L3 Hit Rate L3 MPKI

Page 12: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L3 cache scaling

● 16 MiB sufficiently removes code misses

● Not even 2 GiB captures the shard

● 1 GiB would capture the heap

12

L3 Hit Rate L3 MPKI16 MiB sufficient for instructions

Page 13: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L3 cache scaling

● 16 MiB sufficiently removes code misses

● Not even 2 GiB captures the shard

● 1 GiB would capture the heap

13

L3 Hit Rate L3 MPKI16 MiB sufficient for instructions

Shard

Page 14: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L3 cache scaling

● 16 MiB sufficiently removes code misses

● Not even 2 GiB captures the shard

● 1 GiB would capture the heap

Large shared caches are highly effective for heap accesses.

14

L3 Hit Rate L3 MPKI

1 GiB sufficient for heap16 MiB sufficient for instructions

Page 15: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L3 cache scaling

● 16 MiB sufficiently removes code misses

● Not even 2 GiB captures the shard

● 1 GiB would capture the heap

Large shared caches are highly effective for heap accesses.

The L3 cache is in a region of diminishing returns

15

L3 Hit Rate L3 MPKI

1 GiB sufficient for heap16 MiB sufficient for instructions

Region of diminishingreturns

Page 16: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Memory Hierarchy for Hyper-scale SoCs

16

Page 17: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Optimization strategy

Analysis indicates diminishing returns for L3 caches, but potential for larger caches, thus two contrasting optimizations:

1. Repurpose expensive on-chip transistors in the L3 cache for cores2. Exploit the locality in the heap with cheaper, higher-capacity DRAM incorporated

into a latency-optimized L4 cache

17

Page 18: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Cache vs. Cores Trade-off

Intel Haswell1

● 18 cores● 2.5 MiB L3 per core● Core area cost is 4 MiB L3

18

1 “The Xeon Processor E5-2600 v3: A 22nm 18-core product fam ily” (ISSCC ‘15)

Page 19: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Trading cache for coresSweep core count and L3 capacity in terms of chip area used

● Each core is 4 MiB of L3● Use CAT to vary L3 from 4.5 to 45 MiB

Some L3 transistors could be better used for cores(9c/2.5 MiB/core worse than 11c/1.23

MiB/core)

Core count is not all that matters!(All 18c with < 1 MiB/core are bad)

19

Page 20: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Trading cache for cores

What’s the right cache per core balance?

Incorporate the sweep data into a linear model

● Performance is linear with respect to core count

● We have two measurements for each cache ratio

1 MiB/core of L3 cache allows 5 extra cores and 14% performance improvement

20

Cache-for-Cores Performance

Page 21: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Latency-optimized L4 cache

Target the available locality in the fixed 1 GiB heap

● Not feasible with an on-chip SRAM cache● Need an off-chip, on-package eDRAM cache

○ eDRAM provides lower latency○ Multi-chip package allows for existing 128 MiB dies

● Less than 1% die area overhead● Use an existing high-bandwidth interface such

as Intel’s OPIO

21

Page 22: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Latency-optimized L4 cache● 1 GiB on-package eDRAM● 40-60 ns hit latency● Based on Alloy cache● Parallel lookups with memory● Direct-mapped● No coherence

22

Proposed L4 Cache based on eDRAM

Page 23: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L4 cache miss profile

Baseline is optimized 23-core design with 1 MiB L3 cache per core (iso-area to 18-core)

23

L4 Hit Rate L4 MPKI

Page 24: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L4 cache miss profile

Baseline is optimized 23-core design with 1 MiB L3 cache per core (iso-area to 18-core)

24

L4 Hit Rate L4 MPKI

Page 25: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

L4 cache + cache-for-cores performance

25

● 27% overall performance improvement

● 22% “pessimistic” (60ns hit, 5ns additional miss penalty)

● 38% “future” (+10% latency & misses)

L4 and Cache for Cores

Page 26: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Ongoing work

1. Shard memory misses2. Instruction misses3. Branch stalls and BTB misses4. New system balance ratios

26

Web search leaf node CPU utilization

Page 27: Memory Hierarchy for Web Search - Stanford Universitycsl.stanford.edu/~christos/publications/2018.search.hpca... · 2018-07-03 · Latency-optimized L4 cache Target the available

Conclusions

1. OLDI is an important class of applications about which little data is available

2. Web search is a canary application for OLDI that is inefficient in hardware3. Through a careful rebalancing of the memory hierarchy, we’re able to improve

Google’s web search by 27% today, and 38% in the future

4. There is high potential for new SoCs specifically designed for OLDI workloads

27