supporting x86-64 address translation for 100s of gpu lanes

47
Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014

Upload: pete

Post on 24-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power , Mark D. Hill, David A. Wood. Summary. CPUs & GPUs: physically integrated, logically separate Near Future: Cache coherence, Shared virtual address space P roof-of-concept GPU MMU design - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences

Supporting x86-64 Address Translation for 100s of GPU Lanes

Jason Power, Mark D. Hill, David A. Wood

2/19/2014

Page 2: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 2

Summary

• CPUs & GPUs: physically integrated, logically separateNear Future: Cache coherence, Shared virtual address space

• Proof-of-concept GPU MMU design– Per-CU TLBs, highly-threaded PTW, page walk cache– Full x86-64 support– Modest performance decrease (2% vs. ideal MMU)

• Design alternatives not chosen

2/19/2014

Page 3: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 3

Motivation

• Closer physical integration of GPGPUs

• Programming model still decoupled– Separate address spaces– Unified virtual address (NVIDIA) simplifies code– Want: Shared virtual address space (HSA hUMA)

2/19/2014

Page 4: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 42/19/2014

Tree Index

Hash-table Index

CPU address space

Page 5: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 5

Separate Address Space

2/19/2014

CPU address space

GPU address space

Simply copy data Transform to new pointers

Transform to new pointers

Page 6: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 6

Unified Virtual Addressing

2/19/2014

CPU address space

GPU address space

1-to-1 addresses

Page 7: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 7

void main() {int *h_in, *h_out;

h_in = cudaHostMalloc(sizeof(int)*1024); // allocate input array on host

h_in = ... // Initial host array

h_out = cudaHostMalloc(sizeof(int)*1024); // allocate output array on host

Kernel<<<1,1024>>>(h_in, h_out);

... h_out // continue host computation with result

cudaHostFree(h_in); // Free memory on hostcudaHostFree(h_out);

}

2/19/2014

Unified Virtual Addressing

Page 8: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 8

Shared Virtual Address Space

2/19/2014

CPU address space

GPU address space

1-to-1 addresses

Page 9: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 9

Shared Virtual Address Space

• No caveats

• Programs “just work”

• Same as multicore model

2/19/2014

Page 10: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 10

Shared Virtual Address Space• Simplifies code• Enables rich pointer-based datastructures– Trees, linked lists, etc.

• Enables composablity

Need: MMU (memory management unit) for the GPU • Low overhead• Support for CPU page tables (x86-64)• 4 KB pages• Page faults, TLB flushes, TLB shootdown, etc

2/19/2014

Page 11: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 11

Outline

• Motivation

• Data-driven GPU MMU designD1: Post-coalescer MMUD2: +Highly-threaded page table walkerD3: +Shared page walk cache

• Alternative designs

• Conclusions

2/19/2014

Page 12: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 12

System Overview

2/19/2014

CPUGPU

L2 C

ache

DRAM

CPUCore

L1 Cache

L2 Cache

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Page 13: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 13

GPU Overview

2/19/2014

Compute Unit GPU

Instruction Fetch / Decode

Coalescer

L1 Cache

Shared Memory

L2 C

ache

Register File

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Page 14: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 14

Methodology• Target System– Heterogeneous CPU-GPU system

• 16 CUs, 32 lanes each– Linux OS 2.6.22.9

• Simulation– gem5-gpu – full system mode (gem5-gpu.cs.wisc.edu)

• Ideal MMU– Infinite caches– Minimal latency

2/19/2014

Page 15: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 15

Workloads

• Rodinia benchmarks

• Database sort– 10 byte keys, 90 byte payload

2/19/2014

– backprop– bfs: Breadth-first search– gaussian– hotspot– lud: LU-decomposition

– nn: nearest-neighbor– nw: Needleman-Wunsch– pathfinder– srad: anisotropic diffusion

Page 16: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 16

GPU MMU Design 0

2/19/2014

CU CU CU CU

L1Cache

L1Cache

L1Cache

L1Cache

L2 Cache

Coalescer

I-FetchRegister File

Coalescer

I-FetchRegister File

Coalescer

I-FetchRegister File

Coalescer

I-FetchRegister File

TLBTLBTLBTLBTLBTLBTLBTLB

TLBTLBTLBTLBTLBTLBTLBTLB

TLBTLBTLBTLBTLBTLBTLBTLB

TLBTLBTLBTLBTLBTLBTLBTLB

Page 17: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 17

Design 1

Per-CU MMUs: Reducing the translation request rate

2/19/2014

Page 18: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 18

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

2/19/2014

Page 19: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 19

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Shared Memory(scratchpad)

1x

0.45x

2/19/2014

Page 20: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 20

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Shared Memory(scratchpad)

Coalescer

1x

0.45x

0.06x

2/19/2014

Page 21: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 21

Reducing translation request rate

• Shared memory and coalescer effectively filter global memory accesses• Average 39 TLB accesses per 1000 cycles

Breakdown of memory operations

2/19/2014

Page 22: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 22

Design 1

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Page

faul

t re

gist

er

L1Cache

Page 23: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 23

Performance

2/19/2014

Page 24: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 24

Design 2

Highly-threaded page table walker:Increasing TLB miss bandwidth

2/19/2014

Page 25: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 25

Outstanding TLB Misses

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Coalescer

Shared Memory

2/19/2014

Page 26: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 26

Outstanding TLB Misses

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Lane

Coalescer

Shared Memory

2/19/2014

Page 27: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 27

Multiple outstanding page walks

• Many workloads are bursty• Miss latency skyrockets due to queuing delays if blocking page walker

Concurrent page walks when non-zero (log scale)

2/19/2014

2

Page 28: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 28

Design 2

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Highly-threadedPage table

walker

Page

wal

k bu

ffers

Page

faul

t re

gist

er

L1Cache

Page 29: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 29

Performance

2/19/2014

Page 30: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 30

Design 3

Page walk cache:Overcoming high TLB miss rate

2/19/2014

Page 31: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 31

High L1 TLB miss rate

• Average miss rate: 29%!

2/19/2014

Per-access Rate with 128 entry L1 TLB

Page 32: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 32

High TLB miss rate

• Requests from coalescer spread out– Multiple warps issuing unique streams– Mostly 64 and 128 byte requests

• Often streaming access patterns– Each entry not reused many times– Many demand misses

2/19/2014

Page 33: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 33

Design 3

2/19/2014

L1Cache

L1Cache

L1Cache

L2 Cache

TLB TLB TLB TLB

Shared page walk unit

Coalescer Coalescer Coalescer Coalescer

Highly-threadedPage table

walker

Page walk cache

Page

wal

k bu

ffers

Page

faul

t re

gist

er

L1Cache

Page 34: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 34

Performance

2/19/2014

Worst case: 12% slowdown

Average: Less than 2% slowdown

Page 35: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 35

Correctness Issues

• Page faults– Stall GPU as if TLB miss– Raise interrupt on CPU core– Allow OS to handle in normal way– One change to CPU microcode, no OS changes

• TLB flush and shootdown– Leverage CPU core, again– Flush GPU TLBs whenever CPU core TLB is flushed

2/19/2014

Page 36: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 36

Page faultsNumber of page faults

Average page fault latency

Lud 1 1698 cyclesPathfinder 15 5449 cyclesCell 31 5228 cyclesHotspot 256 5277 cyclesBackprop 256 5354 cycles

• Minor Page faults handled correctly by Linux• 5000 cycles = 3.5 µsec: too fast for GPU context switch

2/19/2014

Page 37: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 37

Summary

• D3: proof of concept • Non-exotic design

2/19/2014

Per-CU L1 TLB entries

Highly-threadedpage table walker

Page walk cache size

Shared L2 TLB entries

Ideal MMU Infinite Infinite Infinite NoneDesign 0 N/A: Per-lane MMUs None NoneDesign 1 128 Per-CU walkers None NoneDesign 2 128 Yes (32-way) None NoneDesign 3 64 Yes (32-way) 8 KB None

Low overhead Fully compatible Correct execution

Page 38: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 38

Outline

• Motivation

• Data-driven GPU MMU design

• Alternative designs – Shared L2 TLB– TLB prefetcher– Large pages

• Conclusions

2/19/2014

Page 39: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 39

Alternative Designs

• Shared L2 TLB– Shared between all CUs– Captures address translation sharing– Poor performance: increased TLB-miss penalty

• Shared L2 TLB and PWC– No performance difference– Trade-offs in the design space

2/19/2014

Page 40: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 40

Alternative Designs

• TLB prefetching:– Simple 1-ahead prefetcher does not affect

performance

• Large pages:– Effective at reducing TLB misses– Requiring places burden on programmer– Must have compatibility

2/19/2014

Page 41: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 41

Alternative Design Summary

2/19/2014

• Each design has 16 KB extra SRAM for all 16 CUs• PWC needed for some workloads

Per-CU L1 TLB entries

Page walk cache size

Shared L2 TLB entries

Performance

Design 3 64 8 KB None 0%Shared L2 64 None 1024 50%Shared L2 & PWC 32 8 KB 512 0%

Page 42: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 42

Conclusions

• Shared virtual memory is important

• Non-exotic MMU design– Post-coalescer L1 TLBs– Highly-threaded page table walker– Page walk cache

• Full compatibility with minimal overhead

2/19/2014

Page 43: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 43

Questions?

2/19/2014

Page 44: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 44

Related Work

• Architectural Support for Address Translation on GPUs: Pichai, Hsu, Bhattacharjee– Concurrent with our work– High level: modest changes for high performance– Pichai et al. investigate warp scheduling– We use a highly-threaded PTW

2/19/2014

Page 45: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 45

Energy and Area of Unified MMU

These configurations provide a huge performance improvement over D2 AND reduce area and (maybe) energy

2/19/2014

*data produced using McPAT and CACTI 6.5

Page 46: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 46

Highly-threaded PTW

2/19/2014

Page walk buffers

Page walk state machine

Outstanding addr StateOutstanding addr StateOutstanding addr StateOutstanding addr State

Outstanding addr State

From memory To memory

TLB TLB TLBPer-CU TLBs:

Page 47: Supporting x86-64 Address Translation for 100s of GPU Lanes

UW-Madison Computer Sciences 47

Page-fault Overview

2/19/2014

CPU core

GPU MMU

CR3

CR2

CR3

PF register

Page table