key ideas goal: increase memory bandwidth 100x using in

1
Key Ideas Goal: Increase memory bandwidth 100X using in-memory processors First smart-memory PIM device that is Capable of executing independent threads of control Designed to support in-memory virtual addressing Target applications Image processing and multimedia (streaming) Irregular memory accesses (sparse-matrices, graphs and pointers) Evolutionary application migration path PIMs also perform standard memory accesses System supports familiar parallel programming paradigms System Architecture PIM PIM Interconnect Interconnect PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM Processor Memory Bus Processor Memory Bus Host Host Processor Processor (Itanium-2) (Itanium-2) Architecture Architecture WideWord WideWord Datapath Datapath WideWord Register File 256b ICache ICache MEMORY MEMORY CONTROL CONTROL & ARBITER ARBITER DATA DATA REGISTERS Node Main Data Bus Node Main Data Bus CTL Node Memory Requests Host Memory Requests ICache Mem Requests HEADER REGISTERS PARCEL BUFFER (“PBUF”) HOST “DRAM” INTERFACE DRAM DRAM MEMORY MEMORY 32 b 256b 256b Scalar Register File Instr . Inter- datapath Register Data Datapath Control Instruction Instruction Pipeline Pipeline Scalar Scalar Datapath Datapath PIM Node Architecture System Architecture PIMs in Action, Delivering Memory PIMs in Action, Delivering Memory Bandwidth Bandwidth Tim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary Hall Tim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary Hall USC VITERBI SCHOOL OF ENGINEERING BGA Top and Bottom Views SRAM SRAM Node Processing Logic, Pbuf, DDR SDRAM Interface, PiRC SRAM SRAM System includes four boards with eight PIM chips PIM DIMMs in IA64 Host System System Prototype Prototype PIM Chip (2 nd Gen.) TSMC 0.18m technology C/ Fortran program PIM Node Compiler Technology Superword-Extended GCC Macintosh G4 executable Superword instruction extended C program Superword Locality Optimizations Compiler-Controlled Caching Page Mode Memory Accesses SLP in Presence of Control Flow Pre-existing ISI Extensions DIVA executable MIT-SLP Overview of Implementation * Insertion of custom PIMs in the memory space of a commodity platform - Itanium-2 HP zx6000 - DDRAM interface * System software - Linux 2.4 & 2.6 device driver for PIM - compiler technology * Challenges related to reliability and bandwidth features of commodity memory systems - ECC - ChipSpare (ChipKill) - address line interleaving Measurements Measurements 8 x 121mm 2 121mm 2 421mm 2 Area single- issue, in- order, pipelined single- issue, in- order, pipelined EPIC, 6-way CPU Info ~8W 453M (440M memory) 140 MHz 8 PIMs ~1W 56.6M (55M memory) 140 MHz 1 PIM ~100W 221M 900 MHz Itanium-2 Power Transistor Clock Comparison: Itanium-2 vs. PIM Exec. Time in sec Data set size in # elts StreamAdd Performance StreamAdd Data Layout Sensitivity Exec. Time in sec Offset between arrays in # elts This material is based on research sponsored by AFRL and NSA under agreement number FA8750-04-1-0265. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and NSA or the U.S. Government. Current Work Current Work Objective Evaluate link discovery (LD) algorithms on Godiva H/W. Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement Expected Results Parallel PIM implementations of LD computations - KR&R PowerLoom logic- based KR&R system Representation Relational knowledge Rich link vocabulary (Natural Language extraction) Complex and dynamic domain knowledge Meta-knowledge (“interestingness”, etc.) Multiple hypotheses Threat/Non-Threat Patterns Target LD Algorithms Properties Sparse graph algorithms Read-only data Some temporal/spatial locality after partitionin Examples Mutual information Graph clustering Model-based and mixture model methods such as HMMs LD Challenges & Solution Approach Data Incomplete data Noise, corruption, uncertainty Unaligned entities or groups Scale Very large data volume Connectivity curse RDBMS, parallelism, focused search KR&R + partial logical inference Statistical methods rarity analysis, mutual information, HHMM PIMS for KNOWLEDGE DISCOVERY: in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI Original Code Random Access Random Access Host and PIM Algorithm Overview Random Access: Host Alone & Host w/ PIMS Host and PIM Code // Host-only RandomAccess uInt64 Table[TABLESIZE]; uInt64 ran; // initialize main table for (i=0; i<TableSize; i++) Table[i]=i; // perform updates ran = 1; for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); Table[ran & (TableSize-1)] ^= ran; } // Host code for Host-and-PIM RandomAccess // initialize main table for (i=0; i<TableSize; i++) Table[i]=i; ran = 1; for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); offset = ran & (TableSize-1); parcel.command = UPDATE; parcel.payload[0] = ran; parcel.payload[1] = offset; SendParcel (&Table[offset], parcel); } parcel.command = DONE; for (pim=0; pim<NumPims; pim++) SendParcel(&done[pim], parcel); // PIM code for Host-and-PIM RandomAccess done = FALSE; while (!done) { // check for parcels from host processor (non blocking) RecvParcel(hostParcelBuffer, recvStatus, parcel); if (recvStatus == TRUE) { if (parcel.command == UPDATE) { ran = parcel.payload[0]; offset = parcel.payload[1]; Table[offset] ^= ran; // local memory access } else if (parcel.command == DONE) { done = TRUE; } } } HOS T Generate next update pair <ran, offset> Bucket update for PIM according to address If bucket full, send parcel Bucket s b 0 b 1 b n-1 PIM 0 PIM n -1 PIM 1 Parcel buffer Memory Recv parcel Perform updates ran 3 ran 1 ran 0 ran 2

Upload: sammy17

Post on 13-Apr-2017

370 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Key Ideas Goal: Increase memory bandwidth 100X using in

Key Ideas•Goal:

–Increase memory bandwidth 100X using in-memory processors

•First smart-memory PIM device that is –Capable of executing independent threads of control–Designed to support in-memory virtual addressing

•Target applications–Image processing and multimedia (streaming)–Irregular memory accesses (sparse-matrices, graphs and pointers)

•Evolutionary application migration path–PIMs also perform standard memory accesses–System supports familiar parallel programming paradigms

System Architecture

PIM InterconnectPIM Interconnect

PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM PIMPIM

Processor Memory BusProcessor Memory Bus

HostHostProcessorProcessor(Itanium-2)(Itanium-2)

Arc

hite

ctur

eA

rchi

tect

ure

WideWordWideWordDatapathDatapath

WideWordRegister

File256b

ICacheICache

MEMORYMEMORYCONTROLCONTROL

&&ARBITERARBITER

DATA DATA REGISTERS

Node Main Data BusNode Main Data Bus

CTL

Node Memory Requests

HostMemory

Requests

ICache Mem Requests

HEADER REGISTERS

PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE

DRAMDRAMMEMORYMEMORY

32b

256b256b

ScalarRegister

File

Instr.

Inter-datapath Register Data

DatapathControl

InstructionInstructionPipelinePipeline

ScalarScalarDatapathDatapath

PIM Node ArchitectureSystem Architecture

PIMs in Action, Delivering Memory BandwidthPIMs in Action, Delivering Memory BandwidthTim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary HallTim Barrett, Spundun Bhatt, Jacqueline Chame Jeff Draper, Mary Hall

USCVITERBI

SCHOOL OFENGINEERING

BGA Top and Bottom Views

SRAMSRAM

Node Processing Logic, Pbuf, DDR SDRAM

Interface, PiRC

SRAMSRAM

System includes four boards with eight PIM chips

PIM DIMMs in IA64 Host

Syst

em P

roto

type

Syst

em P

roto

type PIM Chip (2nd Gen.)

TSMC 0.18m technology

C/ Fortran program

PIM Node Compiler Technology

Superword-Extended GCC

Macintosh G4

executable

Superword instruction extended C program

Superword Locality Optimizations Compiler-Controlled Caching Page Mode Memory AccessesSLP in Presence of Control Flow

Pre-existing ISI Extensions

DIVAexecutable

MIT-SLP

Overview of Implementation* Insertion of custom PIMs in the memory space of a commodity platform

- Itanium-2 HP zx6000- DDRAM interface

* System software- Linux 2.4 & 2.6 device driver for PIM- compiler technology

* Challenges related to reliability and bandwidth features of commodity memory systems

- ECC- ChipSpare (ChipKill)- address line interleaving

Mea

sure

men

tsM

easu

rem

ents

8 x121mm2

121mm2

421mm2

Area

single-issue, in-order, pipelined

single-issue, in-order, pipelined

EPIC, 6-way

CPU Info

~8W453M(440M memory)

140 MHz8 PIMs

~1W56.6M(55M memory)

140 MHz1 PIM

~100W221M900 MHzItanium-2

PowerTransistorClock

Comparison: Itanium-2 vs. PIM

Exe

c. T

ime

in

sec

Data set size in # elts

StreamAdd Performance StreamAdd Data Layout Sensitivity

Exe

c. T

ime

in

sec

Offset between arrays in # elts

This material is based on research sponsored by AFRL and NSA under agreement number FA8750-04-1-0265. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and NSA or the U.S. Government.

Cur

rent

Wor

kC

urre

nt W

ork

Objective Evaluate link discovery (LD) algorithms on Godiva H/W.

Hypothesis LD algorithms are data-intensive and highly parallel Largely read-only data Irregular memory accesses poor cache performance PIM technology would yield performance improvement

Expected Results Parallel PIM implementations of LD computations Performance comparisons with Itanium-2 host Analysis of software/hardware scalability requirements Analysis of programming complexity

-

KR&RPowerLoom logic-

based KR&R system

• Representation Relational knowledge Rich link vocabulary (Natural Language

extraction) Complex and dynamic domain knowledge Meta-knowledge (“interestingness”, etc.) Multiple hypotheses Threat/Non-Threat Patterns

Target LD Algorithms

Properties Sparse graph algorithms Read-only data Some temporal/spatial locality after partitioning

Examples Mutual information Graph clustering Model-based and mixture model methods such as HMMs

LD Challenges & Solution Approach

• Data Incomplete data Noise, corruption, uncertainty Unaligned entities or groups

• Scale Very large data volume Connectivity curse

RDBMS, parallelism, focused search

KR&R + partial logical inference

Statistical methodsrarity analysis, mutual

information, HHMM

PIMS for KNOWLEDGE DISCOVERY:in collaboration with Hans Chalupsky & Jafar Adibi, USC ISI

Original Code

Ran

dom

Acc

ess

Ran

dom

Acc

ess Host and PIM Algorithm Overview Random Access:

Host Alone & Host w/ PIMSHost and PIM Code

// Host-only RandomAccess

uInt64 Table[TABLESIZE];uInt64 ran;

// initialize main tablefor (i=0; i<TableSize; i++) Table[i]=i;

// perform updatesran = 1;for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); Table[ran & (TableSize-1)] ^= ran;}

// Host code for Host-and-PIM RandomAccess// initialize main tablefor (i=0; i<TableSize; i++) Table[i]=i;ran = 1;for (i=0; i<NUpdates; i++) { ran = (ran << 1) ^ (ran < 0? POLY:0); offset = ran & (TableSize-1); parcel.command = UPDATE; parcel.payload[0] = ran; parcel.payload[1] = offset; SendParcel (&Table[offset], parcel);}parcel.command = DONE;for (pim=0; pim<NumPims; pim++) SendParcel(&done[pim], parcel);

// PIM code for Host-and-PIM RandomAccessdone = FALSE;while (!done) { // check for parcels from host processor (non blocking) RecvParcel(hostParcelBuffer, recvStatus, parcel); if (recvStatus == TRUE) { if (parcel.command == UPDATE) { ran = parcel.payload[0]; offset = parcel.payload[1]; Table[offset] ^= ran; // local memory access } else if (parcel.command == DONE) { done = TRUE; } }}

HO

ST

Generate next update pair <ran, offset>

Bucket update for PIM according to address

If bucket full, send parcel

Buckets

b0 b1 bn-1

PI M0

PIM

n

-1

PIM

1

Parcel buffer

MemoryRecv parcel

Perform updates

ran3ran1ran0ran2