a programmable memory hierarchy for prefetching linked data structures

A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Chia-Lin Yang

Department of Computer Science and Information Engineering

National Taiwan University

Alvin R. Lebeck

Department of Computer Science

Duke University

Memory Wall

• Processor-memory gap grows over time

• Prefetching– What ? Future Address Prediction

– When? Prefetch Schedule

100000

1980 1985 1990 1995 2000

Processor-Memory Gap

CPU perform

ance 60% yr

DRAM performance 10% yr

• Linked data structures

– No regularity in the address stream• Adjacent elements are not necessarily contiguous in memory

– Pointer-chasing problem

Prefetch Linked Data Structures (LDS)

p = head;while (p){ work (p->data); p = p -> next;}

while (p){ prefetch (p->next->next->next); work (p->data); p = p -> next;}

currently visiting would like to prefetch

The Push Architecture

• A LDS prefetching framework built on a novel data movement model - Push (Yang’2000)

Main Memory

Traditional Pull Model New Push Model

Outline

• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion

Block Diagram of the Push Architecture

Prefetch Buffer L1

Memory

Prefetch Engine

prefetch req

L2 Bus

Memory Bus

How to Predict Future Addresses?

• LDS traversal kernels

• Load instructions in LDS traversal kernels are a compact representation of LDS accesses [Roth’98]

• PFEs execute LDS traversal kernels independent of the CPU

• The amount of computation between node accesses affects how far the PFE could run ahead of the CPU

while ( list != NULL) { p = list->x; process (p->data); list = list->next; recurrent load}

Main Memory

• Push model : pipelined process

1PFEr2

a2 a2r1

2a2 x2 x1

2 3 41

The Pointer-Chasing Problem: how does the push model help?

Push Architecture Design Issues

CPU L1

Memory

1. PFE Architecture Design

controller

2. Interaction Scheme

5. Demands on the cache/memory

controller

3. Synchronization between the CPU and PFE execution

4. Redundant Prefetch

ISSUE #1: PFE Architecture

• Programmable PFE – General purpose processor core – 5 stage pipeline, in-order processor– Integer ALU units for address calculation & control flow– TLB for address translation – Root register to store the root address of the LDS being traversed

Issue #2: Interaction among PFEs

Root Reg

store [x]

Root Reg

Tree (root); : :

Tree ( node){ if (node) { Tree (node->left); Tree (node->right); }}

x resume

resume x

1store root addressx

issue x

stop L1 PFE

• When do we need to synchronize the CPU and PFE execution?– Early prefetches

• the PFEs are running too far ahead of the CPU

– Useless prefetches• the PFEs are traversing down the wrong path

• the PFEs are running behind the CPU

• Throttle mechanism

Issue #3: Synchronization between CPU and PFEs

Free Bit Cache Blocks

PFECPUproduceconsume

Prefetch Buffer

Variations of the Push Architecture

• 2_PFE should perform comparably to 3_PFE

Main Memory

Main Memory PFE

Main Memory

3_PFE 2_PFE 1_PFE

• 1_PFE performs well if most of LDS exist only in the main memory

Outline

• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion

Experimental Setup

• SimpleScalar: out-of-order processor• Benchmark:

• Olden benchmark suite & rayshade

• Baseline processor:– 4-way issue, 64 RUU, 16 LSQ– lockup-free caches with 8 outstanding misses– 32KB, 32B line, 2-way L1 & 512K, 64B line, 4-way L2– 84 cycle round-trip memory latency & 48 cycle DRAM access time

• Prefetch model– Push model: 3 level PFEs, 32-entry fully-associative prefetch buffer – Pull model: L1 level PFE, 32-entry fully-associative prefetch buffer

Performance Comparison: Push vs. Pull

• health, mst, perimeter and treeadd• Push: 4% to 25% speedup Pull: 0% to 4% speedup

• em3d, rayshade• Push: 31% to 57% speedup Pull: 25% to 39% speedup

• bh• Push: 33% speedup Pull: 33% speedup

• Dynamically changing structures: bisort and tsp

health em3d mst rayshade perimeter bh bisort treeadd tsp voronoi

Benchmark

memory latency

computation time

Variations of the Push Architecture

00.10.20.30.40.50.60.70.80.9

xecution T

• 2_PFE performs comparably to 3_PFE• 1_PFE performs comparably to 3_PFE except for em3d.

Related Work

• Prefetching for Irregular Applications:– Correlation based prefetch (Joseph’97 and Alexander’96)

– Compiler based prefetch (Luk’96)

– Dependence based prefetch (Roth’98)

– Jump-pointer prefetch (Roth’99)

• Decoupled Architecture– Decoupled Access Execute (Smith’82)

– Pre-execution (Annavaram’2001,Collin’2001, Roth’2001, Zilles’2001, Luk’2001)

• Processor-in-Memory– Berkley IRAM Group (Patterson’97)

– Active Page (Oskin’98)

– FlexRAM (Kang’99)

– Impulse (Carter’99)

– Memory-side prefetching (Hughes’2000)

Conclusion

• Build a general architectural solution for the push model

• The push model is effective in reducing the impact of the pointer-chasing problem on prefetching performance– applications with tight traversal loops

• Push : 4% to 25% Pull: 0% to 4%

– applications with longer computation between node accesses

• Push : 31% to 57% Pull: 25% to 39%

• 2_PFE performs comparably to 3_PFE.

Traversal Kernel

void *HashLookup(int key, hash hash){ j = (hash->mapfunc)(key); for (ent = hash->array[j]; ent && ent->key != key; ent = ent->next); if (ent) return ent->entry; return Null;}

void kernel (HashEntry ent, int key){ for (ent ; ent && ent->key != key; ent = ent->next); }

1. traversal kernel identifier2. hash->array[j]3. key

memory-mapped interface

Block Diagram of Specialized PFE

Recurrent Load

Non-Recurrent

Load Table

Root Register

Kernel Id Register

Instruction Buffer

Traversal-Info Table

Ready Queue (pc, base, offset)

Result Buffer (pc)

Cache/Memory Controller

Block Diagram of Programmable PFE

Kernel Id Register

Instruction Buffer

Kernel IndexTable

Result Buffer

Processor

Instruction Cache

Register File

Root reg

: memory-mapped structure

local access

global access

Issue #4: Redundant Prefetches

• Redundant prefetches:

• Tree traversals:

Main Memory

3 4 6 7

Issue #4: Redundant Prefetches

• Performance impact– Waste bus bandwidth

– Memory accesses are satisfied more slowly in the lower level of memory hierarchy

• Add a small data cache in the L2/Memory PFEs

PFE Processor

Data Cacherequest

result

miss request

#Issue 5: Modifications to Cache/Memory Controller

Main Memory

Request Buffer

L2 Bus

Memory Bus

demand requests merge

Request Buffer PFE

demand/prefetch requests merge

How to Avoid Early Prefetches?

3 6 10

4 5 7 8 11

3 6 10

4 5 7 8 11

t1 t2 t3

How to Avoid Early Prefetches?

Free Bit Data

3 6 10

4 5 7 8 11

3 6 10

4 5 7 8 11

Free Bit Data

suspend execution

PFE PFE

continue execution

How to Avoid Useless Prefetches?

Free Bit Data

suspend execution

MemPFE

1 2 3 4 5

MemPFE

trigger execution

L1/L2 misses

L1 hits

1 2 3 4 5

Free Bit Data

::::::::

How to Avoid Useless Prefetches?

Free Bit Data

suspend execution

MemPFE

1 2 3 4 5

MemPFE

trigger execution

Free Bit Data

1trigger execution

MemPFE 7

L1/L2 misses

L1 hits

61 2 3 4 5

Performance Prediction of the Push Architecture for Future Processors

health em3d mst rayshade perimeter bh treeadd tsp

alized

0.8MHz

1.2GHz

1.6GHz

2.0GHz

alized

memory latency

computation time

1.2G1.6G

Prefetch Coverage

Total Hidden Misses

Partial Hidden Misses

Prefetch Distribution

0%10%20%30%40%50%60%70%80%90%

Memory

Cumulative Distance between Recurrent Loads

0%10%20%30%40%50%60%70%80%90%

100% <8

Bandwidth Requirement

Limited Bandw idth

Non_Limited Bandw idth

Effect of the PFE Data Cache & Throttle Mechanism

basepush_basepush_bufferpush_throttlepush_buffer_throttle

• The throttle mechanism has impact on bh.• The PFE data cache has impact on em3d, perimeter and treeadd

Effect of the PFE Data Cache

20%30%

40%50%60%

70%80%

90%100%

em3d perimeter bh treeadd

• em3d, perimeter, bh and treeadd : 30% to 50% of prefetches are redundant

Redundant Prefetch Distribution% of redundant prefetches are captured inthe PFE data cache

prefet

Memory

• 70% to 100% of redundant prefetches are captured in the PFE data cache

PFE Architecture :Effect of Wider Issue PFEs

• Increasing issue width further improves performance, particularly for em3d and treeadd

00.10.20.30.40.50.60.70.80.9

ime base

single

2issue

4issue

TLB Miss Effect

• Hardware TLB miss handler, 30 cycle TLB miss penalty

Benchmark

PFE Architecture: Specialized vs. Programmable PFE

• A programmable PFE can achieve performance comparable to a specialized PFE

00.10.20.30.40.50.60.70.80.9

Health Mst Rayshade

Benchmark

Specialized

Programmable

Breadth-First Tree Traversal

4 5 6 7

8 9 10

Head Tail

Traversal Kernel list = head; while (list) { node = list->ptr; left = node->left; right = node->right; list = list->next; }

::::::::::::::::::::::::::::::::::::::::::::::::

Push Architecture Design Issues

CPU L1

Memory

1. PFE Architecture Design

controller

2. Interaction Scheme

5. Demands on the cache/memory

controller

4. Synchronization between the CPU and PFE execution

3. Redundant Prefetch

Restore PFE State

00400950 addiu $sp[29],$sp[29],-56 00400958 sw $ra[31],48($sp[29]) 00400960 sw $s8[30],44($sp[29]) 00400968 sw $s0[16],40($sp[29]) 00400970 addu $s8[30],$zero[0],$sp[29] 00400978 addu $s0[16],$zero[0],$a0[4] 00400980 beq $s0[16],$zero[0],004009a8 (x)00400988 lw $a0[4],4($s0[16]) miss 00400990 jal 00400950 <K_TreeAdd>(y)00400998 lw $a0[4],8($s0[16]) 004009a0 jal 00400950 <K_TreeAdd> 004009a8 addu $sp[29],$zero[0],$s8[30] 004009b0 lw $ra[31],48($sp[29]) 004009b8 lw $s8[30],44($sp[29]) 004009c0 lw $s0[16],40($sp[29]) ::::::::::

3 4 6 7

x yx issued: 400988

x miss: 400990, 400950 - 400978

y issued: 400998

Register File PC

save registers in the stack

restore registers from the stack

• Correct resume PC– Statically construct the resume PC table

Restore PFE State

Recurrent Load PC Resume PC

400988 400998

400998 4009a8

a programmable memory hierarchy for prefetching linked data structures

Documents

adaptive prefetching on power7: improving performance...

data prefetching algorithm in mobile environments

jungheejungheelee lee - baylor...

optimal prefetching via data compression

far fetched prefetching?

sprint: speculative prefetching of remote data

scalable visual hierarchy exploration - worcester...

inter-core prefetching for multicore processors using...

memory prefetching for the greendroid...

prefetching lecture

a programmable processing array architecture supporting...

prefetching-based data dissemination in vehicular...

cs 838 – chip multiprocessor prefetching

acceleration of xml parsing through prefetching

lecture 18: prefetching · outline of prefetching...

best-offset hardware prefetching - inria

context-aware prefetching at the storage server - usenix ·...

call graph prefetching for database applications

informed mobile prefetching

hybrid prefetching for www proxy servers