xpilot a platform-based synthesis systemcadlab.cs.ucla.edu/soc/docs/fan-techcon2005.pdfsystem...
Post on 22-Dec-2020
3 Views
Preview:
TRANSCRIPT
xPilotxPilot −− A PlatformA Platform--Based Synthesis SystemBased Synthesis System
Project Director: Prof. Jason CongProject Director: Prof. Jason CongEmail: Email: cong@cs.ucla.educong@cs.ucla.edu
Students: Deming Chen, Students: Deming Chen, YipingYiping Fan, Fan, GuolingGuoling Han, Han, Wei Jiang, Wei Jiang, ZhiruZhiru ZhangZhang
October, 2005October, 2005
Supported by SRC, NSF, GSRC, Altera, Xilinx.Supported by SRC, NSF, GSRC, Altera, Xilinx.
2
OutlineOutlineMotivationMotivation
xPilotxPilot system frameworksystem framework
Experimental results Experimental results
ConclusionsConclusions
3
Motivation (1)Motivation (1)Design complexity is outgrowing the traditional RTL Design complexity is outgrowing the traditional RTL methodmethod
Feasible to build Feasible to build SoCSoC device with 500M transistors; Billiondevice with 500M transistors; Billion--transistor chips are on the horizontransistor chips are on the horizonBehavioral synthesis Behavioral synthesis −− a critical technology for enabling the a critical technology for enabling the move to higher level of abstractionmove to higher level of abstractionReasons for previous failuresReasons for previous failures•• Lack of a compelling reason: design complexity is still manageabLack of a compelling reason: design complexity is still manageable a le a
decade of agodecade of ago•• Lack of a solid RTL foundationLack of a solid RTL foundation•• Lack of consideration of physical realityLack of consideration of physical reality
4
Motivation (2)Motivation (2)Behavioral synthesis provides combined advantagesBehavioral synthesis provides combined advantages
Shorter verification/simulation cycleShorter verification/simulation cycleBetter complexity management, faster time to marketBetter complexity management, faster time to marketRapid system explorationRapid system exploration•• Quick evaluation of different hardware/software boundariesQuick evaluation of different hardware/software boundaries•• Fast exploration of multiple microFast exploration of multiple micro--architecture alternativesarchitecture alternatives
Higher quality of resultsHigher quality of results•• PlatformPlatform--based synthesis & optimizationbased synthesis & optimization•• Full consideration of physical realityFull consideration of physical reality
5
Advantages Advantages −− Better Complexity ManagementBetter Complexity ManagementShorter verification/simulation cycleShorter verification/simulation cycle
Simulation speed 100X faster than RTLSimulation speed 100X faster than RTL--based method based method [NEC, ASPDAC04][NEC, ASPDAC04]
Significant code size reductionSignificant code size reductionRTL design ~300KL RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04]Behavioral design 40KL [NEC, ASPDAC04]
VHDL code generated by UCLA VHDL code generated by UCLA xPilotxPilot targeting targeting AlteraAltera Stratix platformStratix platformOver 10x code size reduction can be achievedOver 10x code size reduction can be achieved
6
Advantages Advantages −− Rapid System Exploration (1)Rapid System Exploration (1)Quick evaluation of various amounts of process level Quick evaluation of various amounts of process level concurrency and different hardware/software boundariesconcurrency and different hardware/software boundaries
Model #1 ?
Model #2 ?
Model #3 ?
Model #4 ?
[Source]: UCB Metropolis group[Source]: UCB Metropolis group
7
Advantages Advantages −− Rapid System Exploration (2)Rapid System Exploration (2)Fast exploration of multiple microFast exploration of multiple micro--architecture alternativesarchitecture alternatives
Different hardware implementations can be easily obtained by Different hardware implementations can be easily obtained by varying the highvarying the high--level spec. and applying different design level spec. and applying different design constraintsconstraints
19261926
18621862
17771777
LE#LE#
128128
128128
128128
DSP#DSP#
69266926
52115211
48304830
Cycle#Cycle#
37.837.8
35.435.4
39.139.1
Latency (ns)Latency (ns)
183.62183.6251515.5ns5.5ns
147.28147.2836367ns7ns
123.56123.5634349ns9ns
FmaxFmax (MHz)(MHz)State#State#Target cycle timeTarget cycle time
Platform: Platform: AlteraAltera StratixStratixRTL synthesis & placeRTL synthesis & place--andand--route: route: AlteraAltera QuartusIIQuartusII v5.0v5.0Simulation: Mentor Simulation: Mentor ModelSimModelSim SE6.0SE6.0
8
Advantages Advantages −− Higher Quality of Results (1)Higher Quality of Results (1)PlatformPlatform--based synthesis & optimizationbased synthesis & optimization
The quality of a RTL design is platformThe quality of a RTL design is platform--dependentdependentDesigners often lack the complete and detail knowledge of the taDesigners often lack the complete and detail knowledge of the target rget platformplatform
7.6888 DSP BlocksDSPMUL-24bx24b3.8332 DSP BlocksDSPMUL-18bx18b4.658264 LUTsMUX16to1-24b2.92120 LUTsMUX8to1-24b2.6133 LUTsADDSUB-32b2.2725 LUTsADDSUB-24b
Delay (ns)AreaResource
Platform: Platform: AlteraAltera StratixStratixRTL synthesis & placeRTL synthesis & place--andand--route: route: AlteraAltera QuartusIIQuartusII v5.0v5.0
4.74.73.83.82.82.8
3.73.72.92.92.02.0
2.82.81.81.80.580.58
3X3 Delay Matrix
(0,0)
(95,61)
9
Motivation Motivation −− Higher Quality of Results (2)Higher Quality of Results (2)CommunicationCommunication--centric synthesis & optimization with full centric synthesis & optimization with full consideration of physical realityconsideration of physical reality
System performance & power is dominated by interconnectSystem performance & power is dominated by interconnectIt is difficult for designers to consider physical layout at theIt is difficult for designers to consider physical layout at the RT levelRT level
Data transfer
add1
mul1
add2
mul2LayoutLayout--aware performance aware performance optimizationoptimizationOverlap computation with communicationOverlap computation with communication
LayoutLayout--aware power aware power optimizationoptimization
F
C2’
>
2*, 3* 5*
4*
< mul1(2,5,6)
mul2(3,4)
6*
mul1(2,4,5)
mul2(3,6)
Binding solution 2:Binding solution 2:
mulmul22 can be powered can be powered off when true branch off when true branch is taken is taken
T
Binding solution 1:Binding solution 1:
Both multipliers keep Both multipliers keep activeactive
10
OutlineOutlineMotivationMotivation
xPilotxPilot system frameworksystem frameworkOverview Overview Platform specificationPlatform specificationSystem synthesis data model System synthesis data model SchedulingSchedulingResource bindingResource binding
Experimental results Experimental results
ConclusionsConclusions
11
xPilotxPilot: Platform: Platform--Based Based Synthesis SystemSynthesis System
xPilotSynthesis System
Scheduling
SSDM(System-Level Synthesis
Data Model)
Compilation Front EndCompilation Front End
SystemCSystemC/C Behavior Spec./C Behavior Spec.
Binding
Platform Platform Description Description
& Constraints& Constraints
SSDM/CDFG
SSDM/STG
RTL Generation
RTL VHDL and design constraints
12
SystemSystem--level Synthesis Data Modellevel Synthesis Data ModelSSDMSSDM (System(System--level Synthesis Data Model)level Synthesis Data Model)
Hierarchical Hierarchical netlistnetlist of concurrent processes and communication of concurrent processes and communication channelschannels
Each leaf process contains a sequential program which is represeEach leaf process contains a sequential program which is represented nted by an extended LLVM IR with hardwareby an extended LLVM IR with hardware--specific semanticsspecific semantics•• Port / IO interfaces, bitPort / IO interfaces, bit--vector manipulations, cyclevector manipulations, cycle--level notationslevel notations
13
Platform Modeling & CharacterizationPlatform Modeling & CharacterizationTarget platform specificationTarget platform specification
HighHigh--level resource library with level resource library with delay/latency/area/power curve for delay/latency/area/power curve for various input/various input/bitwidthbitwidth configurationsconfigurations•• Functional units: adders, Functional units: adders, ALUsALUs, ,
multipliers, comparators, etc.multipliers, comparators, etc.•• Connectors: Connectors: muxmux, , demuxdemux, etc., etc.•• Memories: registers, synchronous Memories: registers, synchronous
memories, etc.memories, etc.
Chip layout descriptionChip layout description•• OnOn--chip resource distributionschip resource distributions•• OnOn--chip interconnect delay/power chip interconnect delay/power
estimationestimation
4.74.73.83.82.82.8
3.73.72.92.92.02.0
2.82.81.81.80.580.58
3X3 Delay Matrix for Stratix-EP1S40
(0,0)
(95,61)
14
Synthesis Engine OverviewSynthesis Engine Overview
Scheduling: Scheduling: Assignment of the operations to control states
+4
+2
*5
*1
+3
MUX
MUL (1, 5)ALU (2, 3)
ALU (4)
Binding: Binding: Assignment of the operations and variables to functional units and registers, respectively
Scheduling Binding
CDFGCDFG
+4
+2
*5
*1
+3
STGSTG RTL ModelRTL Model
15
SchedulingScheduling−− Problem StatementProblem StatementScheduling problem in behavioral Scheduling problem in behavioral synthesissynthesis
Given: • A control data flow graph (CDFG)• A set of scheduling constraints: resource
constraints, latency constraints, frequency constraints, relative IO timing constraints, etc.
Goal:• Assign the operations to control states so
that a particular design objective (performance / power) is optimized while all the constraints are satisfied.
+4
+2
*5
*1
+3
CS0
* +
+3
*1
*5
+2
+4
CS1
16
Scheduling Scheduling −− Overall ApproachOverall ApproachOverall approachOverall approach
Current objective: highCurrent objective: high--performanceperformanceUse a system of Use a system of pairwisepairwise difference constraints to difference constraints to express all kinds of scheduling constraintsexpress all kinds of scheduling constraintsRepresent the design objective in a linear functionRepresent the design objective in a linear function
Dependency constraint Dependency constraint •• vv11 vv33 : : xx33 –– xx11 ≥ ≥ 00•• vv22 vv33 : : xx33 –– xx22 ≥ ≥ 00•• vv33 vv55 : : xx44 –– xx33 ≥ ≥ 00•• vv44 vv55 : : xx55 –– xx44 ≥ ≥ 00
Frequency constraint Frequency constraint •• <<vv22 ,, vv55> : > : xx55 –– xx22 ≥ ≥ 11
Resource constraintResource constraint•• <<vv22 ,, vv33>: >: xx33 –– xx22 ≥ ≥ 11
+ *
*
−
+v1 v2
v3
v4
v5
Platform characterization:Platform characterization:•• adder (+/adder (+/––) 2ns) 2ns•• multipilermultipiler (*): 5ns(*): 5ns
Target cycle time: 10nsTarget cycle time: 10nsResource constraint: Only Resource constraint: Only ONE multiplier is availableONE multiplier is available
1 0 -1 0 00 1 -1 0 00 0 1 -1 00 0 0 1 -10 1 0 0 -1
X1X2X3X4X5
0-100-1
≤
A x bTotally Totally unimodularunimodular matrix: matrix: guarantees integral solutionsguarantees integral solutions
17
Scheduling Scheduling −− Design FrameworkDesign FrameworkxPilot scheduler
STG (State Transition Graph)
Platform Spec. & User-specified
constraints
System of pairwisedifference constraints
Relative timing constraintsRelative timing constraintsDependency constraintsDependency constraintsFrequency constraintsFrequency constraints
Resource constraints Resource constraints ……
Constraint equations generation
Objective function generation
CDFG
Linear programming solver
Highlights Highlights Applicable to a wide range of Applicable to a wide range of application domainsapplication domains•• ComputationComputation--intensive, intensive,
memorymemory--intensive, controlintensive, control--intensive, partially timed, etc.intensive, partially timed, etc.
Offers a variety of optimization Offers a variety of optimization techniques in a unified techniques in a unified frameworkframework•• Operation chaining, Operation chaining,
behavioral template, relative behavioral template, relative scheduling, physical layout scheduling, physical layout consideration, etc.consideration, etc.
18
Resource BindingResource Binding−− Problem StatementProblem StatementResource binding problemResource binding problem
Given: (1) A scheduled control data flow graph, i.e., STG; (2) Design constraints: performance, delay, or power, etc.Goal: Assign the operations and variables to functional units and register, respectively, so that their executions or lifetimes are not conflicted, and all of the design constraints are satisfied.
Properties of the problemProperties of the problemFU and register binding are highly FU and register binding are highly correlatedcorrelatedSimultaneous FU and register binding Simultaneous FU and register binding considering interconnection is very considering interconnection is very difficultdifficult
+1
+2
ALU
Two binding solutions:Two binding solutions:Which one is better?Which one is better?The answer depends on:The answer depends on:
1.1. How large are the MUX and How large are the MUX and ALU (platformALU (platform--dependent)dependent)
2.2. Performance and area Performance and area constraintsconstraints
MUX
ALU ALU
BindingBinding
19
The Exploration EngineThe Exploration EngineBranch and pruning Branch and pruning search search Node 1: only one solution (1)Node 1: only one solution (1)Exploring Node 2: Exploring Node 2:
Generated solutions: Generated solutions: •• (1) (2); (1, 2) (1) (2); (1, 2)
Pruned: (none)Pruned: (none)Exploring Exploring Node 3:
Generated solutions (note: 2 and 3 are incompatible) :
• (1) (2) (3); (1, 2) (3); (1, 3) (2)Pruned: (1, 2) (3)
Exploring Exploring Node 4: Generated solutions:
• (1) (2) (3) (4); (1) (2) (3, 4); (1) (2, 4) (3); (1, 4) (2) (3)
• (1, 3) (2) (4); (1, 3) (2, 4); (1, 3, 4) (2)
Pruned: (1) (2, 4) (3); (1, 3) (2, 4)
Design SpaceDesign Space
Area
Delay
(1) (2)(1) (2)
(1, 2)(1, 2)
(1) (2) (3)(1) (2) (3)
(1, 2) (3)(1, 2) (3)
(1, 3) (2)(1, 3) (2)
(1) (2) (3) (4)(1) (2) (3) (4)
(1) (2) (3,4)(1) (2) (3,4) (1) (2, 4) (3)(1) (2, 4) (3)
(1, 4) (2) (3)(1, 4) (2) (3)
(1, 3) (2, 4)(1, 3) (2, 4)
(1, 3) (2) (4)(1, 3) (2) (4)
(1, 3, 4) (2)(1, 3, 4) (2)
Pruning
A State Transition Graph A State Transition Graph (STG)(STG)
C1’
C1
C2 C2’
>1*
2*, 3*4*
5*
6+<
1*
2*
5*
3*
4*
6+
>
<
Compatible GraphsCompatible Graphs
MUL MUL
The corresponding The corresponding datapath modeldatapath model
(1)
20
Our Approach Our Approach –– Unified Resource BindingUnified Resource BindingAn efficient solution space An efficient solution space exploration frameworkexploration frameworkSimultaneous functional unit and Simultaneous functional unit and register bindingregister bindingEmphasize on the interconnect and Emphasize on the interconnect and steering logic networkssteering logic networksGuided by a flexible platformGuided by a flexible platform--based based cost evaluation engine to achieve cost evaluation engine to achieve different objectives, e.g., different objectives, e.g., performance, area, power, etc.performance, area, power, etc.Extendable to exploit physical layout Extendable to exploit physical layout informationinformation
xPilot binding engine
I/O Port and Constant Binding
STG (State Transition Graph)
Platform info & User-specified
constraints
Datapath model for estimation
STG + Best Datapath Models
Branch & pruning Branch & pruning exploration engineexploration engine
Static timing analysis& area,
power estimator
21
OutlineOutlineMotivationMotivation
xPilotxPilot system frameworksystem framework
Experimental results Experimental results
ConclusionsConclusions
22
Experimental Results Experimental Results −− Scheduling ResultsScheduling Results
13.6%13.6%Average Average 8.3%8.3%122 122 14 14 133 133 16 16 ADPCMADPCM--encoder encoder
23.2%23.2%251 251 12 12 327 327 15 15 ADPCMADPCM--decoder decoder 11.5%11.5%1977 1977 42 42 2234 2234 27 27 GIMPGIMP--tilertiler11.6%11.6%375 375 53 53 424 424 32 32 MPEG2MPEG2--dpframedpframe
LPLP#States#StatesLPLP#States#StatesLatency Latency
ImprovementImprovementxPilotxPilotSPARKSPARK
BenchmarkBenchmark
SPARK [UCI/UCSD, 2004], a state of the art academic highSPARK [UCI/UCSD, 2004], a state of the art academic high--level level synthesis toolsynthesis tool
xPilotcanxPilotcan achieve 13.6% improvement on latency over SPARKachieve 13.6% improvement on latency over SPARK
23
Experimental Results Experimental Results −− Comparison with SPARK On SchedulingComparison with SPARK On Scheduling
0030723072265265371371161.6161.616164747xpilotxpilot--memmemMemory unsupported Memory unsupported sparkspark
CACHECACHE
64641,0241,0246,0986,0989,3519,351162.9162.9451451334334xpilotxpilot--memmem6464005,6275,62711,48111,481105.53105.53413413141141xpilotxpilot138138004,5474,54710,84710,84772.0172.01~400~400176176sparkspark
IDCTIDCT
00007837831,3491,349178.7178.740401313xpilotxpilot323200491491508508130.6130.636361313sparkspark
PRPR
4400266266888888161.2161.211112424xpilotxpilot4400367367666666170.8170.818181313sparkspark
MOTIONMOTION
dspdspmemmemregisterregisterLELEfmaxfmax (MHz)(MHz)regreg##state#state#
Altera Quartus II reportAltera Quartus II reportSynthesis Synthesis ReportReportTool/FlowTool/FlowDesignsDesigns
SPARK [UCI/UCSD, 2004], a state of the art academic highSPARK [UCI/UCSD, 2004], a state of the art academic high--level level synthesis toolsynthesis tool
Setting: Setting: AlteraAltera’’ss QuartusIIQuartusII 4.2; Target FPGA device: 4.2; Target FPGA device: StratixStratix; 200MHz; 200MHz
24
Setting: Setting: AlteraAltera’’ss QuartusIIQuartusII 4.2; Target FPGA device: 4.2; Target FPGA device: StratixStratix; 200MHz; 200MHzOn average, On average, xPilot resource binding achieves designs with similar xPilot resource binding achieves designs with similar area, and 2.48x higher frequency over Sparkarea, and 2.48x higher frequency over Spark
2.48 2.48n/a*2.96n/a*0.651.17111111Ave Ratio
1.81 984.272012734776695920469543.1612486802022725095Total
4.67 173.494800424022951053937.176303401313616170FEIG
2.12 146.8416271101752348969.386391020342425DIR
2.04 152.560134873981240274.870560022482808MCM
1.40 166.6146872076911585119.320315010521367LEE
1.40 166.118516625271105118.89027509421217WANG
1.45 178.70552847131349123.53029308151108PR
(MHz)DSPComb-Reg
Lonely-RegCOMBLE(MHz)DSPComb-
RegLonely-
RegCOMBLE
FmaxResource UsageFmaxResource UsageFmaxRatio xPilot/SPARK
xPilotSPARK
Designs
Experimental Results Experimental Results −− Comparison with SPARK On BindingComparison with SPARK On Binding
25
Experimental Results: ASIC Flow (TSMC90nm) (1)Experimental Results: ASIC Flow (TSMC90nm) (1)Magma RTL to GDSII flowTechnology library:
TSMC 90nmTradeoff study:
1st column: delay constraint enforced in xPilot and Magma tool2nd column: control step count of xPilot generated RTL3rd-5th column: data reported after mapping by Magma tool
Designs: PR, DIR
76887688260.14260.14384438441775817758144714472240004000
58625862341.18341.18293129311965319653267326732230003000
57905790518.13518.13193019302463924639402940293320002000
74107410809.71809.71123512354949249492949494946610001000
Total Total LatencLatency (y (psps))
Fmax Fmax (MHz)(MHz)
Path Path Delay Delay ((psps))
Area Area (u2)(u2)
Cell Cell countcount
Latency Latency ((cylcecylce#)#)
Clock Clock period period
constraint constraint ((psps))
76827682260.34260.34384138415531055310370537052240004000
61266126326.47326.4730633063726267262610324103242230003000
54875487546.74546.741829182910078810078817747177473320002000
82328232971.81971.811029102918104518104535889358898810001000
Total Total LatencLatency (y (psps))
Fmax Fmax (MHz)(MHz)
Path Path Delay Delay ((psps))
Area Area (u2)(u2)
Cell Cell countcount
Latency Latency ((cylcecylce#)#)
Clock Clock period period
constraint constraint (ns)(ns)
Plot on the next slide …
26
Experimental Results: ASIC Flow (TSMC90nm) (2)Experimental Results: ASIC Flow (TSMC90nm) (2)Area vs. Clock-period tradeoffs during behavioral synthesis
0
5001000
15002000
25003000
35004000
4500
0 50000 100000 150000 200000
Area (u2)
Clo
ck p
erio
d (p
s)
PRDIR
xPilot-Magma flow shows clear area vs. clock period tradeoffs Behavioral synthesis exams larger tradeoff range than RTL
synthesis
27
ConclusionsConclusionsxPilotxPilot can automatically synthesize behavior level C or can automatically synthesize behavior level C or SystemCSystemC presentation to RTL code with necessary design presentation to RTL code with necessary design constraintsconstraints
PlatformPlatform--based synthesis based synthesis LayoutLayout--driven synthesisdriven synthesisShows encouraging experimental resultsShows encouraging experimental results
xPilotxPilot is an ongoing projectis an ongoing projectEnhanced optimization for further improvement of Enhanced optimization for further improvement of QoRQoRCombining layoutCombining layout--driven synthesis driven synthesis Power related optimizationPower related optimization
28
AcknowledgementsAcknowledgementsWe would like to thank the supports from We would like to thank the supports from NSF, GSRC, SRC, and Industrial NSF, GSRC, SRC, and Industrial
sponsors under the California MICRO programs (Altera, sponsors under the California MICRO programs (Altera, XilinxXilinx))
Group members:Group members:
Zhiru ZhangZhiru ZhangWei JiangWei JiangGuoling HanGuoling Han
Prof. Prof. Deming ChenDeming Chen
YipingYiping FanFan
Prof. Prof. Jason CongJason Cong
29
Thank youThank you
top related