hidisc: a decoupled architecture for applications in data intensive computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

PIs: Alvin M. Despain and Jean-Luc Gaudiot

USCUNIVERSITY

OF SOUTHERN

CALIFORNIA

University of Southern Californiahttp://www-pdpc.usc.edu

October 12th, 2000

2USC Parallel and Distributed Processing Center

HiDISC: Hierarchical Decoupled Instruction Set

ComputerNew Ideas• A dedicated processor for each level of the memory hierarchy

• Explicitly manage each level of the memory hierarchy using instructions generated by the compiler

• Hide memory latency by converting data access predictability to data access locality

• Exploit instruction-level parallelism without extensive scheduling hardware

• Zero overhead prefetches for maximal computation throughput

Impact• 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor

• 7.4x speedup for matrix multiply over an in-order issue superscalar processor

• 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor

• Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)

• Allows the compiler to solve indexing functions for irregular applications

• Reduced system cost for high-throughput scientific codes

Schedule

April 98Start

April 99 April 00

• Defined benchmarks• Completed simulator• Performed instruction-level simulations on hand-compiled benchmarks

• Continue simulations of more benchmarks (SAR)•Define HiDISC architecture•Benchmark result

• Develop and test a full decoupling compiler• Update Simulator•Generate performance statistics and evaluate design

Dynamic Database

Memory

Cache

Registers

Application(FLIR SAR VIDEO ATR /SLD Scientific )Application

(FLIR SAR VIDEO ATR /SLD Scientific )

Processor

Decoupling CompilerDecoupling Compiler

HiDISC Processor

SensorInputs

Processor Processor

Situational Awareness


HiDISC: Hierarchical Decoupled Instruction Set Computer

Dynamic Database

Memory

Cache

Registers

Application(FLIR SAR VIDEO ATR /SLD Scientific )Application

(FLIR SAR VIDEO ATR /SLD Scientific )

Processor

Decoupling CompilerDecoupling Compiler

HiDISC Processor

SensorInputs

Processor Processor

Situational Awareness

Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)

Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]

Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs

Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading


Present Solutions

SolutionLarger Caches

Hardware Prefetching

Software Prefetching

Multithreading

Limitations— Slow

— Works well only if working set fits cache and there is temporal locality.

— Cannot be tailored for each application

— Behavior based on past and present execution-time behavior

— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching

— Adaptive software prefetching is required to change prefetch distance during run-time

— Hard to insert prefetches for irregular access patterns

— Solves the throughput problem, not the memory latency problem


Observation:

• Software prefetching impacts compute performance

• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching

The HiDISC Approach

Approach:

• Add a processor to manage prefetching -> hide overhead

• Compiler explicitly manages the memory hierarchy

• Prefetch distance adapts to the program runtime behavior


Cache

What is HiDISC?

A dedicated processor for each level of the memory hierarchy

Explicitly manage each level of the memory hierarchy using instructions generated by the compiler

Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)

Exploit instruction-level parallelism without extensive scheduling hardware

Zero overhead prefetches for maximal computation throughput

Access Instructions

ComputationInstructions

Access Processor(AP)

Access Processor(AP)

ComputationProcessor (CP)


Registers

Cache Mgmt. Processor

(CMP)

Cache Mgmt. Processor

(CMP)

Cache Mgmt.Instructions

CompilerProgram

2nd-Level Cache and Main Memory


MIPS DEAP CAPP HiDISC(Conventional) (Decoupled) (New Decoupled)


Registers

5-issue

Cache

Access Processor(AP) - (3-issue)


2-issue

Cache Mgmt. Processor (CMP)

Registers



ComputationProcessor (CP)Computation

Processor (CP)

3-issue


Registers

8-issue

Registers

3-issue 3-issueCache Mgmt. Processor (CMP)

DEAP: [Kurian, Hulina, & Caraor ‘94]

PIPE: [Goodman ‘85]

Other Decoupled Processors: ACRI, ZS-1, WA

Cache





Processor (CP)


Processor (CP)


Processor (CP)

CacheCache

Decoupled Architectures


Slip Control Queue The Slip Control Queue (SCQ) adapts

dynamically

• Late prefetches = prefetched data arrived after load had been issued

• Useful prefetches = prefetched data arrived before load had been issued

if (prefetch_buffer_full ())Don’t change size of SCQ;

else if ((2*late_prefetches) > useful_prefetches)

Increase size of SCQ;else

Decrease size of SCQ;


Decoupling Programs for HiDISC

(Discrete Convolution - Inner Loop)

for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);

while (not EOD)y = y + (x * h);

send y to SDQ

for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;

}send (EOD token)send address of y[i] to SAQ

for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;

}

Inner Loop Convolution

Computation Processor Code

Access Processor Code

Cache Management Code

SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data


BenchmarksBenchmarks Source of

Benchmark Lines of Source Code

Description Data Set Size

LLL1 Livermore Loops [45]

20 1024-element arrays, 100 iterations

24 KB

LLL2 Livermore Loops


16 KB



16 KB



16 KB



24 KB

Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations

<64 KB

MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations

448 KB

CHOLSKY NAS kernels 156 Cholsky matrix decomposition

724 KB

VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously

128 KB

Qsort Quicksort sorting algorithm [14]

58 Quicksort 128 KB


Parameter Value Parameter Value

L1 cache size 4 KB L2 cache size 16 KB

L1 cache associativity 2 L2 cache associativity 2

L1 cache block size 32 B L2 cache block size 32 B

Memory Latency Variable, (0-200 cycles)

Memory contention time

Variable

Victim cache size 32 entries Prefetch buffer size 8 entries

Load queue size 128 Store address queue size

128

Store data queue size 128 Total issue width 8

Simulation


Simulation Results

0

1

2

3

4

5

0 40 80 120 160 200Main Memory Latency

LLL3

MIPSDEAPCAPP

HiDISC

0

0.5

1

1.5

2

2.5

3


Tomcatv

MIPSDEAPCAPP

HiDISC

02468

10121416


Cholsky

MIPSDEAPCAPP

HiDISC

0

2

4

6

8

10

12


Vpenta

MIPSDEAP

CAPP

HiDISC


Accomplishments

2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor

7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD)

2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor

Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)

Allows the compiler to solve indexing functions for irregular applications

Reduced system cost for high-throughput scientific codes


Work in Progress

• Compiler design• Data Intensive Systems (DIS) benchmarks

analysis • Simulator update• Parameterization of silicon space for VLSI

implementation


Compiler Requirements

Source language flexibility Sequential assembly code for

streaming• Ease of implementation• Optimality of sequential code• Source level language flexibility• Portability

Ease of implementation Portability and upgradability


Gcc-2.95 Features

Localized register spilling, global common sub expression elimination using lazy code motion algorithms

There is also an enhancement made in the control flow graph analysis.The new framework simplifies control dependence analysis, which is used by aggressive dead code elimination algorithms+ Provision to add modules for instruction scheduling and delayed branch execution+ Front-ends for C, C++ and Fortran available+ Support for different environments and platforms + Cross compilation


Compiler Organization

HiDISC Compilation Overview

GCCGCC

Source Program

Assembly Code

Stream Separator

Computational Assembly CodeComputational Assembly Code

Access Assembly Code

Access Assembly Code

Cache Management

Assembly Code

Computation Assembly CodeComputation

Assembly CodeAccess Assembly

Code

Cache Management Object Code

Cache Management Object Code

AssemblerAssembler Assembler


HiDISC Stream SeparatorSequential Source Sequential Source

Program Flow GraphProgram Flow Graph

Classify Address RegistersClassify Address Registers

Allocate Instruction to streamsAllocate Instruction to streams

Access Stream

Access Stream

Computation Stream

Computation Stream

Fix Conditional StatementsFix Conditional Statements

Move Queue Access into InstructionsMove Queue Access into Instructions

Move Loop Invariants out of the loopMove Loop Invariants out of the loop

Add Slip Control Queue InstructionsAdd Slip Control Queue Instructions

Substitute Prefetches for Loads, Remove

global Stores, andReverse SCQ Direction

Substitute Prefetches for Loads, Remove

global Stores, andReverse SCQ Direction

Add global data Communication and SynchronizationAdd global data Communication and Synchronization

Produce Assembly codeProduce Assembly code

ComputationAssembly CodeComputation

Assembly Code

AccessAssembly Code

AccessAssembly Code

Cache ManagementAssembly Code

Cache ManagementAssembly Code

Current Work

Future Work


Compiler Front End Optimizations

Jump Optimization: simplify jumps to the following instruction, jumps across jumps and jumps to jumps

Jump Threading: detect a conditional jump that branches to an identical or inverse test

Delayed Branch Execution: find instructions that can go into the delay slots of other instructions

Constant Propagation: Propagate constants into a conditional loop


Compiler Front End Optimizations (contd.)

Instruction Combination: combine groups of two or three instructions that are related by data

flow into a single instruction Instruction Scheduling: looks for instructions

whose output will not be available by the time that it is used in subsequent instructions

Loop Optimizations: move constant expressions out of loops, and do strength-reduction


Example of Stressmarks

Pointer Stressmark• Basic idea: repeatedly follow pointers to randomized

locations in memory• Memory access pattern is unpredictable• Randomized memory access pattern:

– Not sufficient temporal and spatial locality for conventional cache architectures

• HiDISC architecture provides lower memory access latency


Decoupling of Pointer Stressmarks

for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++;}if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}

while (not EOD)if (field > partition) balance++; if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}

for (i=j+1; i<w; i++) {load (field[index+i]);GET_SCQ;

}send (EOD token)

for (i=j+1; i<w; i++) {prefetch (field[index+i]);PUT_SCQ;

}

Inner loop for the next indexing

Computation Processor Code

Access Processor Code

Cache Management Code


Stressmarks

Hand-compile the 7 individual benchmarks• Use gcc as front-end• Manually partition each of the three instruction

streams and insert synchronizing instructions Evaluate architectural trade-offs

• Updated simulator characteristics such as out-of-order issue

• Large L2 cache and enhanced main memory system such as Rambus and DDR


Simulator Update

Survey the current processor architecture• Focus on commercial leading edge technology

for implementation

Analyze the current simulator and previous benchmark results• Enhance memory hierarchy configurations• Add Out-of-Order issue


Memory Hierarchy

Current modern processors have increasingly large L2 on-chip caches• E.g., 256 KB L-2 cache on Pentium and Athlon

processor• reduces L1 cache miss penalty

Also, development of new mechanisms in the architecture of the main memory (e.g., RAMBUS) reduces the L2 cache miss penalty


Out-of-Order multiple issue

Most of the current advanced processors are based on the Superscalar and Multiple Issue paradigm. • MIPS-R10000, Power-PC, Ultra-Sparc, Alpha

and Pentium family Compare HiDISC architecture and modern

superscalar processors• Out-of-Order instruction issue• For precision exception handling, include in-

order completion New access decoupling paradigm for out-of-order

issue


HiDISC with Modern DRAM Architecture

RAMBUS and DDR DRAM improve the memory bandwidth• Latency does not improve significantly

Decoupled access processor can fully utilize the enhanced memory bandwidth• More requests caused by access processor • Pre-fetching mechanism hide memory access

latency


HiDISC / SMT

Reduced memory latency of HiDISC can • decrease the number of threads for SMT

architecture• relieve memory burden of SMT architecture• lessen complex issue logic of multithreading

The functional unit utilization can increase with multithreading features on HiDISC• More instruction level parallelism is possible


The McDISC System: Memory-Centered Distributed Instruction Set Computer

Registers

Cache

Access

Main Memory

Processor (AP)


Cache ManagementProcessor (CMP)

Program CompilerAccess Instructions

Computation Instructions

Cache ManagementInstructions

DiscProcessor (DP)

Adaptive SignalPIM (ASP)

Adaptive GraphicsPIM (AGP)

Network InterfaceProcessor (NIP)

Disc Cache

SituationAwareness

SensorInputs

SAR Video

DynamicDatabase

X Y Z

3-D Torusof Pipelined Rings

RAID

Understanding

Inference Analysis

Decision ProcessTargeting

NetworkManagement

Register Linksto CP Neighbors

Instructions

FLIR SAR VIDEO ESS SES

SensorData

Disc Farm

to Displaysand Network


Summary

Designing a compiler• Porting gcc to HiDISC

Benchmark simulation with new parameters and updated simulator

Analysis of architectural trade-offs for equal silicon area

Hand-compilation of Stressmarks suites and simulation

DIS benchmarks simulation

hidisc: a decoupled architecture for applications in data intensive computing

Documents