compilation for scalable, paged virtual hardware eylon caspi qualifying exam 3/6/01 university of...
TRANSCRIPT
Compilation for Scalable,Paged Virtual Hardware
Eylon Caspi
Qualifying Exam
3/6/01
University of California, Berkeley
IA IB
OA OB
3/6/01 2
The Compilation Problem
Programming Model Execution Model• Communicating EFSM operators • Communicating page configs
- unrestricted size, # IOs, timing - fixed size, # IOs, timing
• Paged virtual hardware
Compile
memorysegment
TDFoperator
stream
memorysegment
compute page
streamCompilation is a resource-binding xform on state machines +data-paths
3/6/01 Eylon Caspi – Qualifying Exam 3
Overview
Motivation Paged virtual hardware – software survival + scalability SCORE programming model
Compilation methodology New page partitioning techniques Automatic synthesis & partitioning of communicating FSMs
Evaluation + Architectural Studies Timeline
3/6/01 Eylon Caspi – Qualifying Exam 4
Reconfigurable Computing
Programmable logic +Programmable interconnect (e.g. FPGA)
10x-100x gain vs. microprocessors in: Performance Functional density (work per area-time)
Spatial Computing Parallelism; custom data paths
Programmability Custom execution sequence; specialization
BUT current models expose resource constraints to the programmer Programmer has to target a specific device Limits software longevity
Graphics copyright bytheir respective company
3/6/01 Eylon Caspi – Qualifying Exam 5
Solution: Virtual Hardware
Compute model with unbounded resources Programmer no longer targets a specific device
Enables software longevity, scalability
Requires efficient hardware virtualization Large device concurrent spatial execution Small device time multiplexing Paging model
3/6/01 Eylon Caspi – Qualifying Exam 6
Previous Approaches to Paging
WASMII: Register IO [Ling+Amano, FCCM ‘93] Page IO via registers Evaluate each page for a cycle, then reconfigure Reconfiguration time dominates execution
DPGA: Configuration Cache [DeHon, FPGA ‘94] , TM-FPGA [Xilinx, FCCM ‘97] Fast reconfiguration area, power Reconfiguration power dominates execution
PipeRench: Stripes [CMU, FPGA ‘98] Pipelined reconfiguration Feed-forward computation only
time
3/6/01 Eylon Caspi – Qualifying Exam 7
Paging + Streaming
Streaming allows efficient, useful virtualization Amortizes reconfiguration cost over a larger epoch Exploits program structure Less restrictive communication topology
Compiler and scheduler’s joint responsibility
buffers Swap Swap Swap
Swap
3/6/01 Eylon Caspi – Qualifying Exam 8
SCORE Compute Model
Program = DFG of compute nodes Kahn process network
blocking read, non-blocking write
Compute: SFSM (Streaming Finite State Machine) Concretely: page + FSM to implement token-flow semantics Abstractly: task with local control
Communication: Stream Abstraction of wire, with buffering
Storage: Memory Segment Dynamics:
Dynamic local behavior in SFSM Unbounded resource usage: stream buffer expansion Dynamic graph allocation in STM (Streaming Turing Machine)
9
SCORE Programming Model: TDF
TDF = intermediate, behavioral language for: EFSM Operators • Static operator graphs
State machine for: Firing signatures • Control flow (branching)
Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act)
select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ){ state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S;}
s t f
o
select
3/6/01 Eylon Caspi – Qualifying Exam 10
SCORE Hardware Model
Paged FPGA Compute Page (CP)
Fixed-size slice of RC hardware Fixed number of I/O ports
Distributed, on-chip memory Configurable Memory Block (CMB) Stream access
High-level interconnect
Microprocessor Run-time support + user code
3/6/01 Eylon Caspi – Qualifying Exam 11
SCORE Software Infrastructure
Device Simulator Cycle-accurate behavioral simulation Parameterized (e.g. #pages) Interact with concurrent user processes (STMs) via stream API
Page Scheduler Version 1: dynamic, list-based scheduling (by input availability) Version 2: static, precedence-based
TDF Compiler Compiles to working C++ simulation code No partitioning (page = 1 TDF operator)
Applications Wavelet, JPEG, MPEG, IIR Device size
Runtime
3/6/01 Eylon Caspi – Qualifying Exam 12
Communication is King
With virtualization,Inter-page delay is unknown, sensitive to: Placement Interconnect implementation Page schedule Technology – wire delay is growing
Inter-page feedback is SLOW Partition to contain FB loops in page Schedule to contain FB loops on device
3/6/01 Eylon Caspi – Qualifying Exam 13
Structural Partitioning is Not Enough
Structural partitioning does not address feedback loops Wire min-cut
FM, flow-based
Minimum wire length Spectral
Delay-optimal DAG mapping DAGON, FlowMap, Wong
Structural partitioning does not address communication rates, dynamics All loops are NOT created equal
3/6/01 Eylon Caspi – Qualifying Exam 14
FSM Decomposition is not enough
Ashar+Devadas+Newton (ICCAD ‘89) Minimize logic
Kuo+Liu+Cheng (ISCAS ‘95) Minimize wires
Benini+DeMicheli+Vermeulen (ISCAS ‘98) Minimize power
None consider inter-page delay None consider cutting / scheduling data-path separately
from FSM
Ma
Mb
Ma
Mb
Ma
Mb
Fa
Fb
3/6/01 Eylon Caspi – Qualifying Exam 15
Outline
Motivation Compilation Methodology Evaluation + Architectural Studies Time Line
3/6/01 Eylon Caspi – Qualifying Exam 16
Compilation – Scope
Synthesis + Partitioningof SFSMs TDF Pages Resource binding
Target Parameterized hardware model / simulation
Constrained optimization problem Constraints
page area, IO, timing
Optimality Criteria Primary: Communication delay Secondary: Communication bandwidth, Area
Compile
memorysegment
TDFoperator
stream
memorysegment
compute page
stream
3/6/01 Eylon Caspi – Qualifying Exam 17
Compilation Flow Overview
(1) Optimizations(2) Data path timing + scheduling(3) Partitioning
Ignore: Place / route / retime in page
Known solutions in the community
Page scheduling Responsibility of separate scheduler
3/6/01 18
Synthesis + Partitioning Flow
Pipeline Extraction
Data Path Mapping
Partition Large States
Schedule DF into States
Cluster States
Page Packing
Synthesize Page FSMs
Compiler Optimizations
Optimization
PreliminaryCode
Data-path
Partitioning
p
p
p
p
p
3/6/01 Eylon Caspi – Qualifying Exam 19
How Big is an Operator?
• Wavelet Decode
• Wavelet Encode• JPEG Encode• MPEG Encode
Area for 47 Operators(Before Pipeline Extraction)
0
500
1000
1500
2000
2500
3000
3500
Operator (sorted by area)
Area (4-LUTs)
FSM Area
DF Area
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
3/6/01 20
Partitioning Tasks
(1)Decompose/shrink SFSMs
(2)Pack SFSMsonto page
Pipeline Extraction
Data Path Mapping
Partition Large States
Schedule DF into States
Cluster States
Page Packing
Synthesize Page FSMs
Compiler Optimizations
p
p
p
p
21
Pipeline Extraction
Hoist uncontrolled FF data-flow out of FSMD Benefits:
Shrink FSM cyclic core Extracted pipeline has more freedom for scheduling and
partitioning
Extract
state foo(x): if (x==0)...
state foo(xz): if (xz) ...
x
stat
e
DFCF
x==0xz
x
pipeline pipeline
3/6/01 Eylon Caspi – Qualifying Exam 22
Pipeline Extraction – Extractable Area
Extractable Data-Path Areafor 47 Operators
0
500
1000
1500
2000
2500
3000
3500
Operator (sorted by data-path area)
Area (4-LUTs)
Extracted DF Area
Residual DF Area
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
3/6/01 Eylon Caspi – Qualifying Exam 23
Pipeline Extraction – Residual SFSM
Area for 47 Operators(After Pipeline Extraction)
0
500
1000
1500
2000
2500
3000
Operator (sorted by area)
Area (4-LUTs)
FSM Area
Residual DF Area
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
3/6/01 Eylon Caspi – Qualifying Exam 24
Data-path Mapping / Scheduling
Task: Bind technology-specific area/time to data-path primitives Schedule data-path primitives in state machine
Fixed-frequency target Decompose primitives into multi-cycle operations Data-path module library / tree matching
Pipeline linearized sequences / loops DAG mapping state logic is insufficient
Compiler technology Code motion Software pipelining
3/6/01 Eylon Caspi – Qualifying Exam 25
Delay-Oriented State Clustering
Indivisible unit: state (CF+DF) Spatial locality in state logic
Cluster states into page-size sub-machines Inter-page communication for
data flow, state flow
Sequential delay is in inter-page state transfer Cluster to maintain local control Cluster to contain state loops
Similar to: VLIW trace scheduling [Fisher ‘81]
FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]
VM/cache code placement GarpCC HW/SW partitioning [Callahan ‘00]
3/6/01 Eylon Caspi – Qualifying Exam 26
State Clustering Formulation
Min-cut transition probabilities in state flow graph Probabilities from profiling
Area-constrained Balanced min-cut partitioning
[Yang+Wong, ACM ‘94] Iterate to desired partition area
(1-)A ≤ a(X) ≤ (1+)A IO-constrained
Add wire edges
Mix edge weights: (c)wwire + (1-c)wSF
Use smallest IO-feasible c
Requires all states to be smaller than page
p1 p2
p3 p4
p5
w1 w2
w4
w5 w6
w8
w9
w3
w7
a2
a1
a3
a4
3/6/01 Eylon Caspi – Qualifying Exam 27
Page Packing
Cluster SFSMs + pipelines Avoid page fragmentation
Min-cut streams of top-level DFG Allow cutting pipelines, not SFSMs Area and IO constrained (Wong balanced min-cut partition) Disallow certain topologies
No dynamic-rate streams in page
Data-flow feedback?
3/6/01 Eylon Caspi – Qualifying Exam 28
Outline
Motivation Compilation Methodology Evaluation + Architectural Studies Time Line
3/6/01 Eylon Caspi – Qualifying Exam 29
Evaluating Paging Overhead
Applications Must be rewritten in TDF Existing: • Wavelet, • JPEG, • MPEG, • IIR To do: • ADPCM, • BABAR particle detector
Metrics Circuit area (#pages x page-size) Page delay (LUT depth per firing) Performance (total run-time, “makespan”)
Baseline comparison “Unpartitioned”: page = 1 TDF operator
Ideal virtualization with zero partitioning cost – cannot do better
3/6/01 30
Page Size Studies
Paging overhead varies with: Application • Page size, IO • Match thereof
Is paging overhead robust to a mismatch? Vary page parameters, measure:
(1) Pure area overhead (2) Pure performance overhead
Execute spatially in expanded hardware
(3) Virtualized performance overhead Execute in fixed device size
(1) (2) (3)
3/6/01 Eylon Caspi – Qualifying Exam 31
Outline
Motivation Compilation Methodology Evaluation + Architectural Studies Time Line
3/6/01 Eylon Caspi – Qualifying Exam 32
Status
SCORE compiler / simulator / scheduler Compile+execute unpartitioned (page = 1 TDF op)
Preliminary synthesis + partitioning work Pipeline extraction FSM synthesis to SIS Area-constrained state clustering
To do Complete initial implementation Evaluate Improve – secondary implementation
3/6/01 Eylon Caspi – Qualifying Exam 33
To Complete Initial Implementation
IO-constrained state clustering Decompose large states Page packing Data-path scheduling in states Synthesize partitioned SFSMs
3/6/01 Eylon Caspi – Qualifying Exam 34
Secondary Implementation – Possibilities
Optimizations SW pipelining Use SUIF
State clustering with replication Unified state clustering + page packing
Cluster states of all operators simultaneously
Finer-grained clustering Recast as BDF, min-cut stream rates
3/6/01 Eylon Caspi – Qualifying Exam 35
Time Line
3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8
Impl. 1
Eval
Impl. 2
EvalThesiswriting
Month:
Year: 2001 2002
3/6/01 Eylon Caspi – Qualifying Exam 36
Summary
Partitioning and paging enables Software survival / scaling Efficient use of small HW for dynamic apps
My Contributions Methodology for page synthesis + partitioning
Necessary for efficient virtualization
Evaluation framework Verify that paging can be efficient
Architectural studies
3/6/01 Eylon Caspi – Qualifying Exam 37
Supplemental Material
SFSMs + transforms SCORE simulation + scaling results Page hardware model Synthesis observations Architectural studies
3/6/01 Eylon Caspi – Qualifying Exam 38
TDF Dataflow Process Network
Dataflow Process Network[Parks+Lee, IEEE May ‘95] Process enabled by set of firing rules: R = {R1, R2, …, RN}
Firing rule = set of patterns: Ri = {Ri,1, Ri,2 , …, Ri,p}
DF process for a TDF operator: Feedback arc for state One firing rule per state
Patterns match state value + presence of desired inputs E.g. for state i: Ri = {Ri,1, Ri,2 , …, [i]}
Patterns: Ri,j = [*] if input j is in state i’s input signatureRi,j = if input j is not in state i’s input signatureRi,p = [i] for final input, representing state arc
These are sequential firing rules Partitioned SFSM adds “wait” state
process sta
te
3/6/01 Eylon Caspi – Qualifying Exam 39
SFSM Partitioning Transform
Only 1 partition active at a time Transform to activate via streams
New state in each partition: “wait” Used when not active Waits for activation
from other partition(s) Has one input signature
(firing rule) per activator
Firing rules are not sequential,but determinism guaranteed Only 1 possible activator
Activation streams fromgiven source to given dest.partitions can be merged +binary-encoded
A
B
C
D
A
B
WaitAB
C
D
WaitCD
{A,B}
{A,B}
{C,D}
{C,D}
3/6/01 Eylon Caspi – Qualifying Exam 40
Distributing/Collecting Shared Streams
Requires inter-page synchronization for ordering
Two schemes for input distribution (1) send token to all pages
– Inactive pages must discard tokens,must know how many to discard
(2) send token only to active page– Distributor must know state– (a) present state requests token OR– (b) previous state pre-fetches token
One scheme for output collection– Collector must know state
How to cluster distributors / collectors? Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok) Distributor scheme (2)(a) can be cast into delay-optimal state clustering:
– Decompose reading states into sequences of single-read states– Pre-cluster states that read same stream – this forms distributors– Sequential delay of read request is now modeled as state transfer to distributor
A
B
C
D
i
o
3/6/01 Eylon Caspi – Qualifying Exam 41
Decomposing Large States
A state may be larger than a page
Decomposing into a sequence of page-size states leads to excessive inter-page transfer
Better: delay-optimal DAG-mapping into parallel pages
3/6/01 Eylon Caspi – Qualifying Exam 42
SFSM Optimizations
Many traditional compiler optimization techniques apply to TDF State flow ~ basic block flow Different cost model
“Unlimited” registers and functional units
E.g. work-reducing optimizations Constant folding / propagation Common subexpression elimintation Hoist loop invariants Strength reduction
3/6/01 Eylon Caspi – Qualifying Exam 43
SCORE Functional Simulation
FPGA based on HSRA [Berkeley, FPGA ’99] CP: 512 4-LUTs CMB: 2Mbit DRAM Area for CP-CMB pair:
Page reconfiguration: 5000 cycles (from CMB) Synchronous operation (same clock speed as processor)
x86 microprocessor
Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling
.25: 12.9mm2 (1/9 of PII-450)
.18: 6.7mm2 (1/16 of PIII-600)
3/6/01 Eylon Caspi – Qualifying Exam 44
Application: JPEG Encode
3/6/01 Eylon Caspi – Qualifying Exam 45
Scaling Results: JPEG Encode
Physical Compute Pages
Tot
al T
ime
(Mak
esp
an in
mill
ion
s o
f cy
cles
)
3/6/01 Eylon Caspi – Qualifying Exam 46
Page Hardware Model
Page = fixed-size slice of rsrcs + stream interface
FSM for: Firing • Output emission • Data-path control •
Branching
FSM
Reconfigurable
Fixed logic
3/6/01 47
Page Firing Logic
Sample firing logic 3 inputs (A,B,C) 3 outputs (X,Y,Z) Single signature
3/6/01 Eylon Caspi – Qualifying Exam 48
How Large is a State?
Histogram of Data-Path Area Per State(1404 States from 5 Applications)
162
31
68
4
35
3 1 3 1 2 1 18
3
317
764
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80100 120 140 160 180 200 220 240 260 280 300
Data-Path Area (4-LUTs)
Count
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• IIR
49
SFSM Firing Delay
Complex SFSM may require ≥1 cycle just for control Evaluate firing rule, generate control signals, compute next state
Should we partition SFSM to minimize FSM logic? No – incurring inter-page communication latency is worse!
Histogram of FSM Delayfor 47 Operators
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
Delay (4-LUTs)
Count
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
Histogram of FSM Delayfor 47 Operators(unpartitioned)
4-LUT Depth
Histogram of FSM Inputsfor 47 Operators
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Number of Inputs
Count
• JPEG Encode• JPEG Decode• MPEG (I)• MPEG (P)• Wavelet Encode• IIR
Histogram of FSM Inputsfor 47 Operators(unpartitioned)
3/6/01 Eylon Caspi – Qualifying Exam 50
Scaling the Hardware Resources
A simplified scaling model for architectural studies
Scaling page size (LUTs) induces scaling of other resources, e.g.: Scaling memory
Constant CP-to-CMB ratio
Scaling page IO Rent’s Rule: IO = CAp, (0 ≤ p ≤ 1)