compilation for scalable, paged virtual hardware eylon caspi qualifying exam 3/6/01 university of...

Compilation for Scalable,Paged Virtual Hardware

Eylon Caspi

Qualifying Exam

3/6/01

University of California, Berkeley

IA IB

OA OB

3/6/01 2

The Compilation Problem

Programming Model Execution Model• Communicating EFSM operators • Communicating page configs

- unrestricted size, # IOs, timing - fixed size, # IOs, timing

• Paged virtual hardware

Compile

memorysegment

TDFoperator

stream

memorysegment

compute page

streamCompilation is a resource-binding xform on state machines +data-paths

3/6/01 Eylon Caspi – Qualifying Exam 3

Overview

Motivation Paged virtual hardware – software survival + scalability SCORE programming model

Compilation methodology New page partitioning techniques Automatic synthesis & partitioning of communicating FSMs

Evaluation + Architectural Studies Timeline


Reconfigurable Computing

Programmable logic +Programmable interconnect (e.g. FPGA)

10x-100x gain vs. microprocessors in: Performance Functional density (work per area-time)

Spatial Computing Parallelism; custom data paths

Programmability Custom execution sequence; specialization

BUT current models expose resource constraints to the programmer Programmer has to target a specific device Limits software longevity

Graphics copyright bytheir respective company


Solution: Virtual Hardware

Compute model with unbounded resources Programmer no longer targets a specific device

Enables software longevity, scalability

Requires efficient hardware virtualization Large device concurrent spatial execution Small device time multiplexing Paging model


Previous Approaches to Paging

WASMII: Register IO [Ling+Amano, FCCM ‘93] Page IO via registers Evaluate each page for a cycle, then reconfigure Reconfiguration time dominates execution

DPGA: Configuration Cache [DeHon, FPGA ‘94] , TM-FPGA [Xilinx, FCCM ‘97] Fast reconfiguration area, power Reconfiguration power dominates execution

PipeRench: Stripes [CMU, FPGA ‘98] Pipelined reconfiguration Feed-forward computation only

time


Paging + Streaming

Streaming allows efficient, useful virtualization Amortizes reconfiguration cost over a larger epoch Exploits program structure Less restrictive communication topology

Compiler and scheduler’s joint responsibility

buffers Swap Swap Swap

Swap


SCORE Compute Model

Program = DFG of compute nodes Kahn process network

blocking read, non-blocking write

Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control

Communication: Stream Abstraction of wire, with buffering

Storage: Memory Segment Dynamics:

Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine)

9

SCORE Programming Model: TDF

TDF = intermediate, behavioral language for: EFSM Operators • Static operator graphs

State machine for: Firing signatures • Control flow (branching)

Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act)

select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ){ state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S;}

s t f

o

select


SCORE Hardware Model

Paged FPGA Compute Page (CP)

Fixed-size slice of RC hardware Fixed number of I/O ports

Distributed, on-chip memory Configurable Memory Block (CMB) Stream access

High-level interconnect

Microprocessor Run-time support + user code


SCORE Software Infrastructure

Device Simulator Cycle-accurate behavioral simulation Parameterized (e.g. #pages) Interact with concurrent user processes (STMs) via stream API

Page Scheduler Version 1: dynamic, list-based scheduling (by input availability) Version 2: static, precedence-based

TDF Compiler Compiles to working C++ simulation code No partitioning (page = 1 TDF operator)

Applications Wavelet, JPEG, MPEG, IIR Device size

Runtime


Communication is King

With virtualization,Inter-page delay is unknown, sensitive to: Placement Interconnect implementation Page schedule Technology – wire delay is growing

Inter-page feedback is SLOW Partition to contain FB loops in page Schedule to contain FB loops on device


Structural Partitioning is Not Enough

Structural partitioning does not address feedback loops Wire min-cut

FM, flow-based

Minimum wire length Spectral

Delay-optimal DAG mapping DAGON, FlowMap, Wong

Structural partitioning does not address communication rates, dynamics All loops are NOT created equal


FSM Decomposition is not enough

Ashar+Devadas+Newton (ICCAD ‘89) Minimize logic

Kuo+Liu+Cheng (ISCAS ‘95) Minimize wires

Benini+DeMicheli+Vermeulen (ISCAS ‘98) Minimize power

None consider inter-page delay None consider cutting / scheduling data-path separately

from FSM

Ma

Mb

Ma

Mb

Ma

Mb

Fa

Fb


Outline

Motivation Compilation Methodology Evaluation + Architectural Studies Time Line


Compilation – Scope

Synthesis + Partitioningof SFSMs TDF Pages Resource binding

Target Parameterized hardware model / simulation

Constrained optimization problem Constraints

page area, IO, timing

Optimality Criteria Primary: Communication delay Secondary: Communication bandwidth, Area

Compile

memorysegment

TDFoperator

stream

memorysegment

compute page

stream


Compilation Flow Overview

(1) Optimizations(2) Data path timing + scheduling(3) Partitioning

Ignore: Place / route / retime in page

Known solutions in the community

Page scheduling Responsibility of separate scheduler

3/6/01 18

Synthesis + Partitioning Flow

Pipeline Extraction

Data Path Mapping

Partition Large States

Schedule DF into States

Cluster States

Page Packing

Synthesize Page FSMs

Compiler Optimizations

Optimization

PreliminaryCode

Data-path

Partitioning

p

p

p

p

p


How Big is an Operator?

• Wavelet Decode

• Wavelet Encode• JPEG Encode• MPEG Encode

Area for 47 Operators(Before Pipeline Extraction)

0

500

1000

1500

2000

2500

3000

3500

Operator (sorted by area)

Area (4-LUTs)

FSM Area

DF Area

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR

3/6/01 20

Partitioning Tasks

(1)Decompose/shrink SFSMs

(2)Pack SFSMsonto page

Pipeline Extraction

Data Path Mapping

Partition Large States

Schedule DF into States

Cluster States

Page Packing

Synthesize Page FSMs

Compiler Optimizations

p

p

p

p

21

Pipeline Extraction

Hoist uncontrolled FF data-flow out of FSMD Benefits:

Shrink FSM cyclic core Extracted pipeline has more freedom for scheduling and

partitioning

Extract

state foo(x): if (x==0)...

state foo(xz): if (xz) ...

x

stat

e

DFCF

x==0xz

x

pipeline pipeline


Pipeline Extraction – Extractable Area

Extractable Data-Path Areafor 47 Operators

0

500

1000

1500

2000

2500

3000

3500

Operator (sorted by data-path area)

Area (4-LUTs)

Extracted DF Area

Residual DF Area



Pipeline Extraction – Residual SFSM

Area for 47 Operators(After Pipeline Extraction)

0

500

1000

1500

2000

2500

3000

Operator (sorted by area)

Area (4-LUTs)

FSM Area

Residual DF Area



Data-path Mapping / Scheduling

Task: Bind technology-specific area/time to data-path primitives Schedule data-path primitives in state machine

Fixed-frequency target Decompose primitives into multi-cycle operations Data-path module library / tree matching

Pipeline linearized sequences / loops DAG mapping state logic is insufficient

Compiler technology Code motion Software pipelining


Delay-Oriented State Clustering

Indivisible unit: state (CF+DF) Spatial locality in state logic

Cluster states into page-size sub-machines Inter-page communication for

data flow, state flow

Sequential delay is in inter-page state transfer Cluster to maintain local control Cluster to contain state loops

Similar to: VLIW trace scheduling [Fisher ‘81]

FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]

VM/cache code placement GarpCC HW/SW partitioning [Callahan ‘00]


State Clustering Formulation

Min-cut transition probabilities in state flow graph Probabilities from profiling

Area-constrained Balanced min-cut partitioning

[Yang+Wong, ACM ‘94] Iterate to desired partition area

(1-)A ≤ a(X) ≤ (1+)A IO-constrained

Add wire edges

Mix edge weights: (c)wwire + (1-c)wSF

Use smallest IO-feasible c

Requires all states to be smaller than page

p1 p2

p3 p4

p5

w1 w2

w4

w5 w6

w8

w9

w3

w7

a2

a1

a3

a4


Page Packing

Cluster SFSMs + pipelines Avoid page fragmentation

Min-cut streams of top-level DFG Allow cutting pipelines, not SFSMs Area and IO constrained (Wong balanced min-cut partition) Disallow certain topologies

No dynamic-rate streams in page

Data-flow feedback?


Outline



Evaluating Paging Overhead

Applications Must be rewritten in TDF Existing: • Wavelet, • JPEG, • MPEG, • IIR To do: • ADPCM, • BABAR particle detector

Metrics Circuit area (#pages x page-size) Page delay (LUT depth per firing) Performance (total run-time, “makespan”)

Baseline comparison “Unpartitioned”: page = 1 TDF operator

Ideal virtualization with zero partitioning cost – cannot do better

3/6/01 30

Page Size Studies

Paging overhead varies with: Application • Page size, IO • Match thereof

Is paging overhead robust to a mismatch? Vary page parameters, measure:

(1) Pure area overhead (2) Pure performance overhead

Execute spatially in expanded hardware

(3) Virtualized performance overhead Execute in fixed device size

(1) (2) (3)


Outline



Status

SCORE compiler / simulator / scheduler Compile+execute unpartitioned (page = 1 TDF op)

Preliminary synthesis + partitioning work Pipeline extraction FSM synthesis to SIS Area-constrained state clustering

To do Complete initial implementation Evaluate Improve – secondary implementation


To Complete Initial Implementation

IO-constrained state clustering Decompose large states Page packing Data-path scheduling in states Synthesize partitioned SFSMs


Secondary Implementation – Possibilities

Optimizations SW pipelining Use SUIF

State clustering with replication Unified state clustering + page packing

Cluster states of all operators simultaneously

Finer-grained clustering Recast as BDF, min-cut stream rates


Time Line

3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8

Impl. 1

Eval

Impl. 2

EvalThesiswriting

Month:

Year: 2001 2002


Summary

Partitioning and paging enables Software survival / scaling Efficient use of small HW for dynamic apps

My Contributions Methodology for page synthesis + partitioning

Necessary for efficient virtualization

Evaluation framework Verify that paging can be efficient

Architectural studies


Supplemental Material

SFSMs + transforms SCORE simulation + scaling results Page hardware model Synthesis observations Architectural studies


TDF Dataflow Process Network

Dataflow Process Network[Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules: R = {R1, R2, …, RN}

Firing rule = set of patterns: Ri = {Ri,1, Ri,2 , …, Ri,p}

DF process for a TDF operator: Feedback arc for state One firing rule per state

Patterns match state value + presence of desired inputs E.g. for state i: Ri = {Ri,1, Ri,2 , …, [i]}

Patterns: Ri,j = [*] if input j is in state i’s input signatureRi,j = if input j is not in state i’s input signatureRi,p = [i] for final input, representing state arc

These are sequential firing rules Partitioned SFSM adds “wait” state

process sta

te


SFSM Partitioning Transform

Only 1 partition active at a time Transform to activate via streams

New state in each partition: “wait” Used when not active Waits for activation

from other partition(s) Has one input signature

(firing rule) per activator

Firing rules are not sequential,but determinism guaranteed Only 1 possible activator

Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded

A

B

C

D

A

B

WaitAB

C

D

WaitCD

{A,B}

{A,B}

{C,D}

{C,D}


Distributing/Collecting Shared Streams

Requires inter-page synchronization for ordering

Two schemes for input distribution (1) send token to all pages

– Inactive pages must discard tokens,must know how many to discard

(2) send token only to active page– Distributor must know state– (a) present state requests token OR– (b) previous state pre-fetches token

One scheme for output collection– Collector must know state

How to cluster distributors / collectors? Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok) Distributor scheme (2)(a) can be cast into delay-optimal state clustering:

– Decompose reading states into sequences of single-read states– Pre-cluster states that read same stream – this forms distributors– Sequential delay of read request is now modeled as state transfer to distributor

A

B

C

D

i

o


Decomposing Large States

A state may be larger than a page

Decomposing into a sequence of page-size states leads to excessive inter-page transfer

Better: delay-optimal DAG-mapping into parallel pages


SFSM Optimizations

Many traditional compiler optimization techniques apply to TDF State flow ~ basic block flow Different cost model

“Unlimited” registers and functional units

E.g. work-reducing optimizations Constant folding / propagation Common subexpression elimintation Hoist loop invariants Strength reduction


SCORE Functional Simulation

FPGA based on HSRA [Berkeley, FPGA ’99] CP: 512 4-LUTs CMB: 2Mbit DRAM Area for CP-CMB pair:

Page reconfiguration: 5000 cycles (from CMB) Synchronous operation (same clock speed as processor)

x86 microprocessor

Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling

.25: 12.9mm2 (1/9 of PII-450)

.18: 6.7mm2 (1/16 of PIII-600)


Application: JPEG Encode


Scaling Results: JPEG Encode

Physical Compute Pages

Tot

al T

ime

(Mak

esp

an in

mill

ion

s o

f cy

cles

)


Page Hardware Model

Page = fixed-size slice of rsrcs + stream interface

FSM for: Firing • Output emission • Data-path control •

Branching

FSM

Reconfigurable

Fixed logic

3/6/01 47

Page Firing Logic

Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature


How Large is a State?

Histogram of Data-Path Area Per State(1404 States from 5 Applications)

162

31

68

4

35

3 1 3 1 2 1 18

3

317

764

0

20

40

60

80

100

120

140

160

180

200

0 20 40 60 80100 120 140 160 180 200 220 240 260 280 300

Data-Path Area (4-LUTs)

Count

• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• IIR

49

SFSM Firing Delay

Complex SFSM may require ≥1 cycle just for control Evaluate firing rule, generate control signals, compute next state

Should we partition SFSM to minimize FSM logic? No – incurring inter-page communication latency is worse!

Histogram of FSM Delayfor 47 Operators

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

Delay (4-LUTs)

Count


Histogram of FSM Delayfor 47 Operators(unpartitioned)

4-LUT Depth

Histogram of FSM Inputsfor 47 Operators

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Number of Inputs

Count


Histogram of FSM Inputsfor 47 Operators(unpartitioned)


Scaling the Hardware Resources

A simplified scaling model for architectural studies

Scaling page size (LUTs) induces scaling of other resources, e.g.: Scaling memory

Constant CP-to-CMB ratio

Scaling page IO Rent’s Rule: IO = CAp, (0 ≤ p ≤ 1)

compilation for scalable, paged virtual hardware eylon caspi qualifying exam 3/6/01 university of...

Documents