trading cache hit rate for memory performance wei ding, mahmut kandemir, diana guttman, adwait jog,...

36
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State University

Upload: darlene-quinn

Post on 18-Jan-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

Trading Cache Hit Rate for Memory Performance

Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli

The Pennsylvania State University

Page 2: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

2

Summary

ProposalA compiler-runtime cooperative data layout optimization

that improves row-buffer locality in irregular programs

~17% improvement in overall application performance

ProblemMost data locality optimizations target exclusively cache

locality. “Row Buffer Locality” is also important.The problem is especially challenging in the case of

irregular programs (sparse data)

Page 3: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

3

Outline

• Background• Motivation• Conservative Layout• Fine-grain Layout• Related Work• Evaluation• Conclusion

Page 4: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

4

DRAM Organization

DIMMDRAM chip

Processor

MC

MC

Rank

Channel

BankRow Buffer

Row-buffer Locality

Page 5: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

5

Irregular Programs

Real X(num_nodes), Y(num_edges);Integer IA(num_edges, 2);for (t = 1, t < T, t++) {/* If it is time to update the interaction list */ for (i = 0, i < num_edges; i++) { X(IA(i, 1)) = X(IA(i, 1)) + Y(i); X(IA(i, 2)) = X(IA(i, 2)) - Y(i); }}

Page 6: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

6

Inspector/Executor model/* Executor */Real X(num_nodes), Y(num_edges);Real X’(num_nodes), Y’(num_edges);Integer IA(num_edges, 2);for (t = 1, t < T, t++) { X’, Y’ = Trans(X, Y); for (i = 0, i < num_edges; i++) { X’(IA(i, 1)) = X’(IA(i, 1)) + Y’(i); X’(IA(i, 2)) = X’(IA(i, 2)) - Y’(i); } }

/* Inspector */Trans(X, Y): for (i = 0, i < num_edges; i++) { /* data reordering algorithms */ } return (X’, Y’)

Used for identifying parallelism or improving cache locality

Page 7: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

7

Outline

• Background• Motivation• Conservative Layout• Fine-grain Layout• Related Work• Evaluation• Conclusion

Page 8: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

8

Row-buffer Locality

• Prior works that target irregular applications exclusively focus on improving cache locality– No efforts to improve row-buffer locality

• Typical latencies (based on AMD architecture)– Last Level Cache (LLC) hit = 28 cycles– Row-buffer hit = 90 cycles– Row-buffer miss = 350 cycles

• Application performance is dictated not only by the cache hitrate, but also by the row-buffer hitrate.

Page 9: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

9

Example

Layout (b) eliminates the row-buffer miss caused by accessing ‘y’. Assuming this move will not cause any additional cache misses

Layout (c) eliminates the row-buffer misses caused by accessing ‘v’ even at the cost of an additional cache miss

1 2 3

Page 10: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

10

Outline

• Background• Motivation• Conservative Layout• Fine-grain Layout• Related Work• Evaluation• Conclusion

Page 11: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

11

Notations

• Seq: the sequence of data elements obtained by traversing the index array

• αx: the access to a particular data element x in Seq

• time(αx): the “logical time stamp” of x in Seq

• βx: the memory block where data element x resides

• αx,: the “most recent access” to βx before αx

• Caches(βx): the set of cache blocks to which βx can be mapped in a k-way set-associative cache

Page 12: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

12

Definition

• Block Distance: Given Caches(βx) = Caches(βY), the block distance between αx and αy , denoted as Δ(αy , αx) is the number of “distinct" memory blocks that are mapped to Caches(βx) and accessed during the time period between time(αx) and time(αy)

Page 13: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

13

Lemma

• Locality Set: A set of data elements, denoted by Ω, forms a locality set, if and only if:– x y = βy

(All elements of reside in the same memory block)– x y αx , αy : Δ(αy , αx) ≤ k

(The block distance between any pair of elements in the set ≤ k)– x ∉ y αx , αy : Δ(αy , αx) > k

(The block distance between an element and a non-element > k)

• Non-increased Cache Misses: Moving from βx to βy will not increase the total number of cache misses in Seq if Caches(βy) = Caches(βx)

Page 14: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

14

Conservative Layout

• Objective:– Increase row-buffer hitrate–Without affecting the cache performance

• Algorithm1. Identifying the locality sets2. Constructing the interference graph3. Assigning rows in memory

Page 15: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

15

1. Identifying the Locality Sets

• Traversing the index array, for each cache set, we maintain a list of most frequent accesses to ‘k’ different memory blocks– The block distance between the current access

and any other access on the list is never greater than k

• During this traversal, x and y are placed into the same locality set only when = βy

Page 16: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

16

2. Constructing the Interference Graph

• Each node represents a locality set• If αx and αy are the two accesses that incur

successive cache misses in Seq, and x and y are located in different rows, then an edge is added between the locality sets of x and y– Weight on this edge represents the total number

of such αx and αy pairs

Page 17: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

17

3. Assigning Rows in Memory

• Sort the edges in the interference graph in decreasing order

• Assign same row to the locality sets connected by the edge with the largest weight

Page 18: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

18

Outline

• Background• Motivation• Conservative Layout• Fine-grain Layout• Related Work• Evaluation• Conclusion

Page 19: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

19

Fine-grain Layout

• Partition: Given x , a partition for x is defined as a subset of , denoted as Px, where x Px

• Basic Idea: Whenever the accesses to two data elements (denoted as x and y) incur successive cache misses and x and y reside in different rows in memory– Try to find two partitions for x and y, Px and Py, such

that, when placing Px and Py into the same row, the increased cache miss latency is less than the reduced row-buffer miss latency

Page 20: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

20

Algorithm

1. Constructing the Interference Graph2. Constructing the Locality Graphs3. Finding Partitions4. Assigning Rows in Memory

Page 21: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

21

1. Constructing the Interference Graph

• Each node in the interference graph represents a data element

• If αx and αy are two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then we set up an edge between x and y– Weight on the edge represents the number of

such αx and αy pairs

Page 22: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

22

2. Constructing the Locality Graphs

• Each locality set has a locality graph, where each node is a data element in

• For any access αu whose block reuse distance is exactly k, if there exists αx and αy within time slot [time(αu,), time(αu)], such that x, y and u belong to the same locality set, then we increase the weight of the edge between x and y by 1– If we move all the elements in a partition for x to another

memory block , such that Caches() = Caches(), then the number of increased cache misses is at most equal to the sum of the weights of the edges connected to x in the locality graph

Page 23: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

23

3. Finding Partitions

• Sort the edges in interference graph in decreasing order

• We first consider isolating x and y from their locality sets, i.e., placing only x into Px , and only y into Py

• We add data elements connected to x into Px and elements connected to y to Py until

(N - Nrb) x rb > Nch x

Page 24: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

24

4. Assigning Rows in Memory

• Each partition is assigned to a memory block in a row

Page 25: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

25

Example

Page 26: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

26

Related Work

• Inspector/Executor model – Typically used for parallelism (Lawrence

Rauchwerger [1]) and cache locality (Chen Ding [2])– We use it to improve row-buffer locality and our

approach is complementary to them• Row buffer locality– Compiler approach: Mary W. Hall [3]– Hardware approach: Al Davis [4]– Our work specifically targets irregular applications

Page 27: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

27

Outline

• Background• Motivation• Conservative Layout• Fine-grain Layout• Related Work• Evaluation• Conclusion

Page 28: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

28

Evaluation

CPU 12 cores; 2.6 GHz; 4 memory controllers

Caches 64KB per core L1 (3 cycles);512KB per core L2 (12 cycles);12MB per socket shared L3 (28 cycles)

Memory DDR3-1866; 8 banks per channel; 8KB row-buffers

Name Input Size L3 Miss rate

RB Miss rate

PSST 427.6 MB 18.1 % 29.6 %PaSTiX 511.6 MB 24.3 % 41.7 %SSIF 129.3 MB 13.7 % 24.4 %PPS 738.2 MB 21.4 % 33.1 %REACT 1.2 GB 28.6 % 46.9 %

BenchmarksPlatform (modeled in GEM5)

Page 29: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

29

Simulation ResultsCa

che-

Opti

miz

ed

Cons

erva

tive

Fine

-Gra

in

Cach

e-O

ptim

ized

Cons

erva

tive

Fine

-Gra

in

Cach

e-O

ptim

ized

Cons

erva

tive

Fine

-Gra

in

L3 Misses Row-Buffer Misses Execution Cycles

0.5

0.6

0.7

0.8

0.9

1

1.1

PSTT PaSTiX SSIF PPS REACT

Nor

mal

ized

wrt

Ori

gina

l

6 % 15 %

27 %12 %

17 %

Page 30: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

30

4KB 8KB 16KB 32KB0.4

0.5

0.6

0.7

0.8

0.9

Cache-Optimized Conservative Fine-Grain

Row-Buffer Size

Nor

mal

ized

Exe

cutio

n Cy

cles

2 4 80.4

0.5

0.6

0.7

0.8

0.9

Cache-Optimized ConservativeFine-Grain

Number of Memory Controllers

Nor

mal

ized

Exe

cutio

n Cy

cles

4 8 12 16 200.4

0.5

0.6

0.7

0.8

0.9

Cache-Optimized ConservativeFine-Grain

Number of Cores

Nor

mal

ized

Exe

cutio

n Cy

cles

8MB 10MB 12MB 14MB 16MB0.40.50.60.70.80.9

1

Cache-Optimized ConservativeFine-Grain

L3 Capacity

Nor

mal

ized

Exe

cutio

n Cy

cles

Page 31: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

31

Conclusion

• Exploiting row-buffer locality is critical for application performance

• We proposed two compiler-directed data layout organizations with the goal of improving row-buffer locality in irregular applications– Without affecting cache performance– Trading cache performance for row-buffer locality

Page 32: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

32

Thank You

• Questions?

Page 33: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

33

References

1. “Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time”, ICPP 2012

2. “Sensitivity Analysis for Automatic Parallelization on Multi-Cores”, ICS 2007

3. “A compiler algorithm for exploiting page-mode memory access in embedded dram devices“, MSP ’02

4. “Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement”, ASPLOS 2010

Page 34: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

34

BACKUP SLIDES

Page 35: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

35

Results with AMD based systemCa

che-

Opti

miz

ed

Cons

erva

tive

Fine

-Gra

in

Cach

e-O

ptim

ized

Cons

erva

tive

Fine

-Gra

in

Cach

e-O

ptim

ized

Cons

erva

tive

Fine

-Gra

in

L3 Misses Row-Buffer Misses Execution Cycles

0.5

0.6

0.7

0.8

0.9

1

1.1

PSTT PaSTiX SSIF PPS REACT

Nor

mal

ized

wrt

Ori

gina

l

Page 36: Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State

36PSTT PaSTiX SSIF PPS REACT0

0.2

0.4

0.6

0.8

1

1.2

1.4

FCFS FCFS [Conservative] FCFS [Fine-Grain]FR-FCFS FR-FCFS [Conservative] FR-FCFS [Fine-Grain]ATLAS ATLAS [Conservative] ATLAS [Fine-Grain]TCM TCM [Conservative] TCM [Fine-Grain]

Nor

mal

ized

Exe

cutio

n Cy

cles

Memory Scheduling