copyright © bluespec inc. 2006 confidential and proprietary from esl to implementation: reinventing...
Post on 18-Dec-2015
215 views
TRANSCRIPT
Copyright © Bluespec Inc. 2006 Confidential and Proprietary
From ESL to Implementation:Reinventing Hardware Design
using
Bluespec SystemVerilog™
© 2006, Bluespec, Inc.
2Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Joe Stoy
Founder and Principal Engineer
Bluespec Inc.14-16 Spring Street
Waltham MA 02451, USA+1 781 250 2206
www.bluespec.com
3Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogWorkshop Agenda
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigmSynthesis quality: as good as hand-coded RTLTool flowsFutures:
Formal verification
4Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Intro: why an improved HDL is a central need to address today’s chip design complexities
5Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Moore’s Law: “Silicon capacitydoubles every 18 to 24 months”
Source: http://www.intel.com/technology/silicon/mooreslaw/index.htm
Today (2005):• ~10-20 M gates• 90nm, 65nm
6Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Today’s chips: “SoC”s(System on a Chip)
“IP” blocks (“Intellectual Property”)
ProcessorsCaches, MemoriesInterconnectsDMAsOther peripheral blocksI/O blocks
E.g., cell phones, cell network base stations, TV set-top boxes, iPods, digital cameras, …
System Bus
Peripheral Bus
BusBridge
MemoryControllerProcessor
DMAController
DSP
PowerManagement
Arbitration
ApplicationSpecific
DRAMSRAM
L2Cache
Ctlr
SerialController
Audio VideoFlash/Mem
I/FBus
Controller
7Copyright © Bluespec Inc. 2006 Confidential and Proprietary
ASIC design flow, and costs
Architecture
Design
Verification and Test
Physical Design
time
Can take ~ 12-24 monthsCan cost $10 Million+ (and rising)Bug respin cost + market window cost
8Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification costs and chip quality are getting worse
66% of new ICs/ASICs require at least one re-spin
75% are due to logical/functional errors(an increase from 71% two years prior)
I C Design Costs
0
5
10
15
20
25
30
0.18µm 0.13µm 90nm
Silicon Feature Dimension
Cost
($
M) Prototype
ValidationPhysicalVerificationArchitecture
Source: IBM/IBS, Inc.
Source: 2004 Collett study
9Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Design affects everything!
Myth: improving the Design language will have little impact
In fact, the Design language impacts all activities
ArchitectureDesignVerification and Test
Physical Design
ArchitectureDesign
Verification and Test
Physical Design
ArchitectureDesign
Verification and Test
Physical Design
10Copyright © Bluespec Inc. 2006 Confidential and Proprietary
“It is a profoundly erroneous truism, repeated by all copybooks and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilization advances by extending the number of important operations which we can perform without thinking about them. …”
Alfred North WhiteheadMathematician and philosopher (1861-1947)
[ Example: long division used to be an advanced subject in the days of Roman numerals; Arabic numerals changed that ]
How to improve productivity?
11Copyright © Bluespec Inc. 2006 Confidential and Proprietary
The language of design is crucial!
Software analogy:
Assembler Fortran C C++ Java
No theoretical difference (all Turing-complete) “I can produce better code by writing it in Assembler”
Maybe, if you are given enough time! “Better” = more efficient, but not more readable, maintainable, or reusable
You can still write incorrect code; you still need to debug; you still need to verify. But the probabilities of certain bugs decrease and the kinds of bugs change, as you go to higher levels:
Register protocol, argument/result-passing protocol, stack protocol, byte/word-alignment issues
Reentrancy and recursion protocols Memory layout of complex data Memory allocation/deallocation Type-misinterpretation Code reuse (parameterization and polymorphism)
12Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Some lessonsfrom SW language history
The size/complexity of the system that you can build, correctly, within a short time, improves with higher levels of abstraction
But also, crucially, people will not/ cannot use your new higher level language for serious work
if it sacrifices efficiency if it is unpredictable/uncontrollable
13Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Evolution of HDLs(Hardware Description Languages)
Hand-drawncircuitdiagrams(schematics)
SchematicCapture(automated)
~1985Text-basedRTL langs:Verilog &VHDL
time
IEEE Verilog standards(also VHDL standards)
2004SystemVerilog
(Accellera)
1995 2001 2005
2005IEEE
(RTL = Register-Transfer Level)
?
14Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Fully synthesizable – without compromise!
Bluespec: Better Design Accelerates Everything!
Architecture
Design
Verification and Test
Physical Design
More architectural flexibility during
design
50% reduction in errors, faster
correction
50% reduction from design to verified
netlist
Architectural exploration
Early executable models
Early executable models
Better reuse
Faster fixes, to achieve closure
15Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec, Inc. company and technology background
16Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Sandburst Corp: 10Gb/s core router ASICs(Bluespec: further technology development)
Bluespec, Inc. background
Research@MIT on high-level synthesis & verification
Technology
TechnologyVC funding
VC funding
~1996 2000 2003
Bluespec, Inc.: high-level design and syn-thesis tool(SystemVerilog-based)
17Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec, Inc.
Headquartered in Waltham, MA ~45 people (MA, CA, Europe, Armenia, India)
Technology, 1997-present MIT research: Professor Arvind, students &
colleagues Patented: HW synthesis from Rules
Active IEEE P1800/Accellera member; SV language contributor, System C language contributor
18Copyright © Bluespec Inc. 2006 Confidential and Proprietary
What does Bluespec offer?
A new and powerful way to explore and express designs;
tools to simulate and to synthesize into quality RTL;
feeding into existing RTL-to-chip tools/flows
SystemVerilog(design subset)
Verilog 95 RTL
Verilog sim
Bluespec Synthesis
RTL synthesis,Physical design
Tapeout
Bluesim
Rulesand
Rule-based Interfaces
with
CycleAccurate
19Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec Solutions
20Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec core technologies
Design – executable specifications Synthesizable, high-level concurrency semantics Transactional interfaces for design with self-documenting protocol
Verification – static and formal Strong type checking Interface connectivity and protocol checking Race condition identification and management Multiple-domain clock and interface checking Rapid simulation with C/C++ functions
21Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec tools
Scheduling
Optimization
RTL Generation
Static Checking
Power Optimization
Parsing Parsing
BSVSystemC [ESE]
RTL
gcc
libsystemc.h
.exe
TRANSLATE
CommonSynthesis
Engine
Bluespec Synthesis BluesimSystemC Simulation
Rapid,Source-Level
Simulation andInteractive
Debug of BSV
Cycle-Accurate
w/Verilog sim
Cycle-Accurate
w/Verilog sim
Blu
evie
w D
ebu
g
22Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Creating ESL methodologies
Abstraction Level Purposes Design Components Prerequisites
Bandwidth Accurate
Architectural Exploration
Transactions, Functional Model
Simulation speed, instrumentation, protocol checking
Latency Accurate
Software Test Platform
Functional Model with accurate timing and full concurrency
Simulation speed and register interfaces
Cycle Accurate
Power Optimization &
Firmware Development
Defined Buses, registers,
concurrency
Rapid changes in micro-architecture and automatic RTL
generation
Bit AccurateImplementation
& Integration
Automatically generated with rules & formal
interfaces
Easy ECOs and timing closure
Con
sist
ent
Con
nect
ivity
thr
ough
For
mal
I/F
Met
hods
Con
sist
ent
Ver
ifica
tion
and
Deb
uggi
ng P
arad
igm
s
23Copyright © Bluespec Inc. 2006 Confidential and Proprietary
ESL to Implementation
BSVSystemC [ESE]
RTL
gcc
libsystemc.h
.exe
TRANSLATE
BluespecSynthesis
Bluesim
Blu
evie
w
SystemCSimulation
Technologies Tools Methodologies
Concurrency Semantics
Formal Interfaces
Static, Formal Checking
Low Power Optimization
Bandwidth Accurate
Latency Accurate
Cycle Accurate
Bit Accurate
24Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
25Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Str
uctu
ral
Beh
avio
ral
Bluespec SystemVerilog™A one slide overview
Rules and Rule-based Interfaces
For complex concurrency and control, across multiple shared resources, across module boundaries
Two dimensions raising the level of abstraction (fully synthesizable)
VHDL/Verilog/SystemVerilog/SystemC
Bluespec SystemVerilog
High-level abstract typesPowerful static checking
Powerful parameterizationPowerful static elaboration
Advanced clock management
26Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Str
uctu
ral
Beh
avio
ral
Bluespec SystemVerilog™A one-slide overview
Rules and Rule-based Interfaces
For complex concurrency and control, across multiple shared resources, across module boundaries
Two dimensions raising the level of abstraction (fully synthesizable)
VHDL/Verilog/SystemVerilog/SystemC
Bluespec SystemVerilog
High-level abstract typesPowerful static checking
Powerful parameterizationPowerful static elaboration
Advanced clock management
27Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Complex concurrencywith shared resources
HW by its very nature is highly concurrent A HW design can be viewed as a set of cooperating concurrent
FSMs The cooperation occurs through shared resources
Today’s SoCs have enormous amounts of complicated concurrency and shared resourcesHow do we express this today?
Concurrency expressed with processes (“always” blocks in RTL) Access to shared resources are tediously micro-managed (if-
then-elses inside always blocks)
Unfortunately: this does not scale Leads to race conditions (inconsistent state in the shared
resources) which are very tricky to discover, diagnose, fix
28Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Simple example withconcurrency and shared resources
Process 0: increments register x when cond0
Process 1: transfers a unit from register x to register y when cond1
Process 2: decrements register y when cond2
Each register can only be updated by one process on each clock. Priority: 2 > 1 > 0
Just like real applications, e.g.: Packet arrives, is processed, departs
0 1 2x y
+1 -1 +1 -1
Process priority: 2 > 1 > 0
cond0 cond1 cond2
29Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Which oneis correct?
What’s required to verify that they’re correct?What if the priorities changed: cond1 > cond2 > cond0?What if the processes are in different modules?
always @(posedge CLK) begin
if (!cond2 || cond1) x <= x – 1; else if (cond0) x <= x + 1;
if (cond2) y <= y – 1; else if (cond1) y <= y + 1;end
0 1 2x y
+1 -1 +1 -1
Process priority: 2 > 1 > 0cond0 cond1 cond2
always @(posedge CLK) begin
if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1;
if (cond2) y <= y – 1; else if (cond1) y <= y + 1;end
30Copyright © Bluespec Inc. 2006 Confidential and Proprietary
With Bluespec, the design is direct
(* descending_urgency = “proc2, proc1, proc0” *)
rule proc0 (cond0); x <= x + 1;endrule
rule proc1 (cond1); y <= y + 1; x <= x – 1;endrule
rule proc2 (cond2); y <= y – 1;endrule
Hand-written RTL:Complexity due to: State-centric (for synthesizability) Scheduling clutter
BSV:Functional correctness follows directly from rule semantics
Executable spec (operation-centric)
Automatic handling of shared resource mux logic
Same hardware as the RTL
0 1 2x y
+1 -1 +1 -1
Process priority: 2 > 1 > 0
cond0 cond1 cond2
31Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Now, let’s make a small change: add a new process and insert its priority
01
2
x y
+1
-1 +1
-1
Process priority: 2 > 3 > 1 > 0
cond0 cond1 cond2
3+2 -2
cond3
32Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Process priority: 2 > 3 > 1 > 0
Changing the Bluespec design
01
2
x y
+1
-1 +1
-1
cond0 cond1 cond2
3+2 -2
cond3
(* descending_urgency = “proc2, proc1, proc0” *)
rule proc0 (cond0); x <= x + 1;endrule
rule proc1 (cond1); y <= y + 1; x <= x – 1;endrule
rule proc2 (cond2); y <= y – 1;endrule
(* descending_urgency = "proc2, proc3, proc1, proc0" *) rule proc0 (cond0); x <= x + 1;endrule rule proc1 (cond1); y <= y + 1; x <= x - 1;endrule rule proc2 (cond2); y <= y - 1; x <= x + 1;endrule rule proc3 (cond3); y <= y - 2; x <= x + 2;endrule
Pre-Change
?
33Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Process priority: 2 > 3 > 1 > 0
Changing the Verilog design
01
2
x y
+1
-1 +1
-1
cond0 cond1 cond2
3+2 -2
cond3
always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1;
if (cond2) y <= y – 1; else if (cond1) y <= y + 1;end
always @(posedge CLK) begin if ((cond2 && cond0) || (cond0 && !cond1 && !cond3)) x <= x + 1; else if (cond3 && !cond2) x <= x + 2; else if (cond1 && !cond2) x <= x - 1 if (cond2) y <= y - 1; else if (cond3) y <= y - 2; else if (cond1) y <= y + 1;end
Pre-Change
?
34Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Key Benefits
Executable specificationsRapid changesBut, with fine-grained control of RTL:
Define the optimal architecture/micro-architecture
Debug at the source OR RTL level – designer understands both
The Quality of Results (QoR) of RTL!
35Copyright © Bluespec Inc. 2006 Confidential and Proprietary
The concurrency complexities illustrated in the simple example are greatly magnified in real designs
36Copyright © Bluespec Inc. 2006 Confidential and Proprietary
A more complex example,from CPU design
Dave & Arvind, 2003
Speculative, out-of-orderMany, many concurrent activities
Branch
RegisterFile
ALUUnitRe-
OrderBuffer(ROB) MEM
Unit
DataMemory
InstructionMemory
Fetch Decode
FIFO
FIFO FIFO FIFO FIFO
FIFO
FIFOFIFO
FIFOFIFORe-
OrderBuffer(ROB)
Branch
RegisterFile
ALUUnit
MEMUnit
DataMemory
InstructionMemory
Fetch Decode
37Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Many concurrent actions on common state: nightmare to manage explicitly
EmptyWaiting
EW
Head
Tail
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V 0 -Instr B V 0W
V 0 -Instr C V 0W
-Instr D V 0W
V 0 -Instr A V 0W
V - -Instr - V -
V - -Instr - V -E
E
E
E
E
E
E
E
E
E
E
E
V 0
Re-Order Buffer
Put aninstr into
ROB
DecodeUnit
RegisterFile
Get operandsfor instr
Writebackresults
Get a readyALU instr
Get a readyMEM instr
Put ALU instr results in ROB
Put MEM instr results in ROB
ALUUnit(s)
MEMUnit(s)Resolve
branches
Operand 1 ResultInstruction Operand 2State
38Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Branch Resolution• …• …• …
Commit Instr• Write results to registerfile (or allow memorywrite for store)• Set to Empty• Increment head pointer
Write Back Results to ROB• Write back results toinstr result• Write back to all waitingtags• Set to done
Dispatch Instr• Mark instructiondispatched• Forward to appropriateunit
But in BSV…
..you can code each operation in isolation, as a rule
..the tool guarantees that operations are INTERLOCKED (i.e. each runs to completion without external interference)
Insert Instr in ROB• Put instruction in firstavailable slot• Increment tail pointer• Get source operands
- RF <or> prev instr
39Copyright © Bluespec Inc. 2006 Confidential and Proprietary
The key:
Rules execute atomically
Reference semantics:while some rules are enabled
choose one enabled rule execute it
40Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Atomicity
atomic
41Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Atomicity
ατομος
42Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Atomicity
a_tomic not
asymmetric atypical amoral
43Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Atomicity
a_tomic not
asymmetric atypical amoral
cut microtome Tomography appendectomy tome (of a multi_volume book)
44Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Atomicity
Rules are atomic“Not cut”
Whenever they run, they run to completion never interrupted
No other activities are interleaved with them
This greatly simplifies design avoids many race conditions easier to prove invariants
45Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Extensive supporting theoryin computer science literature
Term Rewriting Systems, Terese, Cambridge Univ. Press, 2003, 884 pp.Parallel Program Design: A Foundation, K. Mani Chandy and Jayadev Misra, Addison Wesley, 1988
UNITY programming language for concurrent, reactive systemsTerm Rewriting and All That, Franz Baader and Tobias Nipkow, Cambridge Univ. Press, 1998, 300pp.Using Term Rewriting Systems to Design and Verify Processors, Arvind and Xiaowei Shen, IEEE Micro 19:3, 1998, p36-46Proofs of Correctness of Cache-Coherence Protocols, Stoy et al, in Formal Methods for Increasing Software Productivity, Berlin, Germany, 2001, Springer-Verlag LNCS 2021Superscalar Processors via Automatic Microarchitecture Transformation, Mieszko Lis, Masters thesis, Dept. of Electrical Eng. and Computer Science, MIT, 2000… and more …
The intuitions underlying this theoryare easy to use in practice
46Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Synthesizing Rules into efficient clocked synchronous HW
- Automatically generates correct HW for the most error-prone parts of hand-written RTL
- While retaining transparency, predictability and designer control
47Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Clocked synchronous hardware
The compiler translates BSV source code into Verilog RTL
TransitionLogic
IOS“Next” SCollection
ofState
Elements
48Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Clocked semantics
Reference semantics:while some rules are enabled
choose one enabled rule execute it
Clocked semantics:every clock cycle: execute as many rules as you
can provided the overall effect is as
if they executed serially in some order
49Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Rule semanticsmapped to hardware semantics
Rules
HW
Ri Rj Rk
clocks
rule steps
Ri
RjRk
The effect of each cycle is as if a sequence of ruleswas executed one-at-a-time
Consequence: The HW state can never be aninterleaving of actions from different rules
Rule atomicity (therefore, correctness) is preserved
50Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Synthesizing a single rule
x
y
zcurrentstate
nextstate
enablesignals
x’
y’
z’
rule foo (… cond … (x < y) …); … action … x <= x + z …endrule
next-statevaluesQ D
EN
actionlogic
condlogic
rule foo
51Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Synthesizing multiple rules
Different rules can read/write common state. Therefore,
Need multiplexing of next state values into shared state element inputs
Need control of which rules get to update next state elements
Control of next state “enables” Control of next state data multiplexers
52Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Synthesizing multiple rules
Scheduler ensures consistency with Rule semanticsUsually the most error-prone part of hand-written RTL
Here, correct by constructionBluespec patented technology
Scheduler
DataSelect
State
D Q
Enable
RuleN
CondN
ActionN
Cond1
Action1
Rule1 Rule Control
53Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Transparency and predictability
Scheduler
DataSelect
State
D Q
Enable
RuleN
CondN
ActionN
Cond1
Action1
Rule1 Rule Control
Bluespec synthesisonly adds this part
User-specified structures dominates area, critical pathsMicroarchitecture remains completely under user control
54Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Comparing BSV to traditional“Behavioral Synthesis”
55Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Function vs. Algorithm
People often say: “I’m describing the algorithm of my HW block using C/C++ or Behavioral RTL”Actually, they’re describing the function, not the algorithm
A function: spec of I/O behavior, without consideration for implementability, and in particular without consideration for cost in space (circuitry) or time (performance)
An algorithm: a specific implementation with a particular cost model
Different computation models, with different cost models, usually require radically different algorithms for implementing the same function
56Copyright © Bluespec Inc. 2006 Confidential and Proprietary
“Behavioral Synthesis”
Past products: Synopsys Behavioral Compiler
(withdrawn) Get2Chip (absorbed into
Cadence)
Current products: Mentor’s CatapultC Synfora Forte (in SystemC) …
Behavioral Synthesis tool
“Behavior” of designexpressed as sequential program(e.g., in C or procedural Verilog)
RTL
57Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Behavioral Synthesis:the technology has a long history
Control-flow graph (sequential CDFG)
Parallel CDFG(Control/Data Flow Graph)
Parsing …
Dependency Analysis and associatedtransforms (“automatic parallelization”)
Vector computers (~1975 …)
VLIW/IA64, Cellular, SIMD, dataflow, SMP, cluster, cache-
friendly, … (~1980s …)
Hardware (RTL)(~1990s …)
Tractable only forcertain loop-and-array codes, without anycomplex control (where it can workspectacularly well)
Synthesis (target-specific)
Sequential source program
58Copyright © Bluespec Inc. 2006 Confidential and Proprietary
The “Automatic Parallelization” problem
The input (C program) is totally sequential, because of C semantics
We want the synthesized hardware to exploit parallelism, for high performance
The Automatic Parallelization problem: Undo/remove the input’s sequentiality, converting into a parallel form
59Copyright © Bluespec Inc. 2006 Confidential and Proprietary
“Automatic Parallelization”:Example — matrix multiplication
void matmult (int A[N,N], B[N,N], C[N,N]){ int i, j, k, innerProductSum; for (i = 0; i < N; i++) for (j = 0; j < N; j++) { innerProductSum = 0; for (k = 0; k < N; k++) innerProductSum += A[i,k] * B[k,j]; C[i,j] = innerProductSum; }}
C A B
innerproduct
i
j
i
j
60Copyright © Bluespec Inc. 2006 Confidential and Proprietary
+
x
+
x
+
x
+
x
“Automatic Parallelization”:Example — matrix multiplication
Can the k loop (inner product) be executed in parallel?
The “*”s can be done in parallel, but the “+”s are still sequenced
A[i,*]B[*,j]
0 C[i,j]
k=0 k=N-1
61Copyright © Bluespec Inc. 2006 Confidential and Proprietary
+
x
+
xx
+
xx
+
x
“Automatic Parallelization”:Example — matrix multiplication
A clever compiler could transform it into tree accumulation, which has more parallelism
Depends on commutativity, associativity of “+” May not be true if the integers can overflow! May not be true for floating point numbers!
A[i,*]B[*,j]
C[i,j]
k=0 k=N-1
62Copyright © Bluespec Inc. 2006 Confidential and Proprietary
“Automatic Parallelization”:Example — matrix multiplication
Can the i and j loops be executed in parallel? Not as written, because all the k loops read and write a single
common variable, “innerProductSum”! A clever compiler can eliminate this using “scalar expansion”:
converting it into an array Note: most clever programmers would do the opposite!
void matmult (int A[N,N], B[N,N], C[N,N]){ int i, j, k, innerProductSum [N,N] ; for (i = 0; i < N; i++) for (j = 0; j < N; j++) { innerProductSum [i,j] = 0; for (k = 0; k < N; k++) innerProductSum [i,j] += A[i,k] * B[k,j]; C[i,j] = innerProductSum [i,j] ; }}
63Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Automatic Parallelization: history
Studied extensively since the 1960s (vectorizing/ parallelizing/ VLIW/ EPIC software compilers)
Fundamental problems: Complex control structures, pointers and aliasing (memory
indirection), dynamic data allocation, … are all difficult/ impossible to parallelize automatically
C is often a bad starting point: best parallel algorithm for a given function can be quite different from best sequential algorithm
Parallel algorithm designers prefer to start with a clean slate from a functional specification, not a C algorithm with unnecessary sequential baggage
Has succeeded only in limited domain: simple array-based loop nests
SW community has abandoned automatic parallelization of general-purpose programs; is mostly used only for scientific/ technical computing, linear algebra, …
64Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Automatic Parallelization:transparency, predictability, controllability
Another common issue with automatic parallelization and behavioral synthesis
Designer loses intuition and precise control over generated output
Behavioral synthesis: tool decides microarchitecture based on complex optimization criteria
“What HW will result, with this input C program?” “What will be the effect on the resulting HW, if I make this
change to the input C program?” “What change should I make to the input C program, to
improve the HW in this way?”
65Copyright © Bluespec Inc. 2006 Confidential and Proprietary
ComplexDatapaths
(e.g.processor/controller)
ComplexDatapaths
(e.g.processor/controller)
ControlControl
TechnicalAlgorithms
(e.g. DSP/math)
TechnicalAlgorithms
(e.g. DSP/math)
System Bus
Peripheral Bus
BusBridge
MemoryControllerProcessor
DMAController
DSP
PowerManagement
Arbitration
ApplicationSpecific
DRAMSRAM
L2Cache
SerialController
Audio VideoFlash/Mem
I/FBus
Controller
Behavioral Synthesis:Applicability
IDCTMotion compensatorDES
FIR filter
Only few IP blocks may benefit from Behavioral Synthesis
66Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Comparing “Model of Time” in BSV vs. Automatic Synthesis from C/C++
C/C++: completely untimed No relationship between source model of time
(sequential C code execution) and target model of time (HW clocks)
BSV: untimed to timed Initially, designer writes arbitrarily complex rules,
i.e., any amount of functional computation per rule Designer refines this (splitting rules, if necessary) so
that the functional computation per rule is feasible in HW in a target clock speed/ technology
BSV tool schedules multiple rules per clock
67Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Comparing Concurrency Model inBSV vs. SystemC
BSV: Rules Atomic transactions Tool generates control logic to manage concurrency
SystemC Threads and events
Higher-level synchronization abstractions built on top of events: semaphores, locks, blocking methods, …
Designer manages atomicity explicitly (consistent access to multiple shared resources)
68Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Historical improvementsin concurrency control
CycleAccounting
Semaphores(locks, events, …)
Atomic objects(structured locking)
SW:pthreads
HW: RTL, SystemC
SW:Java
HW: Bluespec
Atomic transactions(multiple resources)
Higher level(less error-prone)
today1950
SW: Database Systems,Distributed Systems
69Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Elevating designabove RTL
Bluespec Explicit<LOC
Rules withMethods
<LOC
SystemCFunctionality with tool
determined structure and resources, but only for
simple array-based FOR loops (for SystemC:
anything else is explicit)<LOC
ExplicitlyManaged
Wires<LOC
C/C++ N/A
RTL ExplicitExplicitlyManaged
Wires
CORRESPONDENCE/TRANSPARENCY TO HARDWARE STRUCTURE
CONCURRENCY/COORDINATION
COMMUNICATION
<LOC = Fewer Lines of Code than RTL
70Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
71Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Str
uctu
ral
Beh
avio
ral
Bluespec SystemVerilog™A one-slide overview
Rules and Rule-based Interfaces
For complex concurrency and control, across multiple shared resources, across module boundaries
Two dimensions raising the level of abstraction (fully synthesizable)
VHDL/Verilog/SystemVerilog/SystemC
Bluespec SystemVerilog
High-level abstract typesPowerful static checking
Powerful parameterizationPowerful static elaboration
Advanced clock management
72Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Consider a FIFO, in RTL
enq() first()/deq()
module mkFIFO_model (output notFull, input [31:0] dataIn, input enq_enab, output notEmpty, output [31:0] first, input deq_enab); …endmodule
module mkFIFO_implem (output notFull, input [31:0] dataIn, input enq_enab, output notEmpty, output [31:0] first, input deq_enab); …endmodule
32
32notFull
enq_enab
notEmpty
deq_enab
mkFIFO
dataIn
first
In Verilog:
73Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Modules written to be used by others require detailed specifications
A small sample of the informal,written interface specification (8 pages):
data_outdata_in
push_req_n
clk
pop_req_n
rstn
full
empty“Designware” FIFO and associated documentation
74Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Module interfaces: summarycritique of today’s RTL methodology
Two modules that implement the same interface have to repeat the same port list (tedious, error prone)Interfaces are flat, unstructured port lists
No concept of grouping ports according to “transactions”
No specification of behavior on the interface “enq_enab allowed only if notFull” “data_in should be valid with enq_enab” “first only valid if notEmpty” “deq_enab allowed only if notEmpty”
Behavior is typically specified in ad hoc text and timing diagrams
Verification obligation, often to incomplete specs
75Copyright © Bluespec Inc. 2006 Confidential and Proprietary
A FIFO in SystemVerilog
enq() first()/deq()
interface FIFO; bit notFull, enq_enab; bit [31:0] dataIn; bit notEmpty, deq_enab, bit [31:0] first; modport ifc (output notFull, notEmpty, first, input dataIn, enq_enab, deq_enab);endinterface
module mkFIFO_model (FIFO.ifc); …endmodule
module mkFIFO_implem (FIFO.ifc); …endmodule
32
32notFull
enq_enab
notEmpty
deq_enab
en
q
de
qfir
st
mkFIFO
dataIn
first
In SystemVerilog:
76Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Module interfaces: summarycritique of SystemVerilog methodology
Interface port lists are separately specified (independent of any module implementing the interface)
Two modules that implement the same interface can share the same interface definition (improves “plug and play”)
But, still: Interfaces are flat, unstructured port lists
No concept of grouping ports according to “transactions” No specification of behavior on the interface
“enq_enab allowed only if notFull” “data_in should be valid with enq_enab” “first only valid if notEmpty” “deq_enab allowed only if notEmpty”
Behavior is typically specified in ad hoc text and timing diagrams Verification obligation, often to incomplete specs
Note: SV does allow definition of tasks and functions inside an interface definition, and this provides some limited ability to group according to transactions and to encapsulate interface behavior
77Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Rule-based Interfaces
Robust, parameterizable, correct-by-construction way to express interactions with a module
Extend Rule Semantics across module boundaries
Capture the protocol of a complete “transaction” with a module
Capture inter-transaction scheduling constraints
78Copyright © Bluespec Inc. 2006 Confidential and Proprietary
A FIFO in BSV
interface FIFO#(type itemType); method Action enq (itemType x); method itemType first (); method Action deq (); method Action clear ();endinterface
Each method captures a complete transaction protocol: RDY
e.g., enq() is allowed (the FIFO is not full) e.g., deq() is allowed (the FIFO is not empty)
ENABLE e.g., when enq() or deq() is invoked
Input data buses (method arguments) Output data buses (method results)
More abstract than port lists and ad hoc timing diagrams Never have any timing errors at interfaces
enq() first()/deq()
79Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Methods map directly into HW ports: FIFO
rdy
enabn
rdy
enq
clea
r
not full
always true
Any m
odule
that
pro
vid
es
a F
IFO
inte
rface
enab
enq():• n-bit argument• has side effect (Action)
first():• no argument• n-bit result
deq():• no argument• has side effect (Action)
clear():• no argument• has side effect (Action)
rdy
enab
deq
not empty
n
rdy first
not empty
80Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Interface methods are HW!
Interface method declarations look like functions/ procedures in SWUses of interface methods look like function/ procedure calls in SW
But: think HW, not SW or process simulation!
A definition of an interface method in a module is a manifest bit of circuitry behind its portsA use of an interface method is just a set of connections (wires) to the module interface portsThere is no “call/execute/return”, stack frame, …!
81Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Interface methods fit smoothlyinto rules
module … FIFO#(int) iFifo <- mkFIFO; FIFO#(int) oFifo1 <- mkFIFO; FIFO#(int) oFifo2 <- mkFIFO;
rule (iFifo.first[0] == 0); iFifo.deq; oFifo1.enq (iFifo.first); endrule
rule (iFifo.first[0] == 1); iFifo.deq; oFifo2.enq (iFifo.first); endrule endmodule
route
All the implicit conditions (notFull, notEmpty) are automaticallyhandled by incorporating into Rule conditions.This eliminates much clutter, and improves correctness.
82Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Module interfaces: Inter-transaction scheduling constraints
“With my FIFO, you can enq and deq simultaneously …
Engineer 1“naiveFIFO”
… in most cases, but not if it’s either empty or full.”
Engineer 2“PipelineFIFO”
… even if it’s full.(Think of it as a deq first, making room for a following enq, but squeezed into a single clock. This naturally fits into regster semantics: read old value, write new value. ) ”
Engineer 3“BypassFIFO”
… even if it’s empty.(Think of it as an enq first, making an item available for a following deq, but squeezed into a single clock. This is just a bypass of a value from input to output. ) ”
Architect to Engineers: Please design for me a FIFO in which I can enq and deq simultaneously (i.e., in the same clock)
83Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Inter-transaction scheduling constraints
enq() deq()
# of elements in FIFO
0 1 2
NaïveFIFO enq enq || deq deq
PipelineFIFO enq enq || deq deq < enq
BypassFIFO enq < deq enq || deq deq
For 3 FIFO designs (capacity 2) and various conditions, allowable operations and their “in the same clock” semantics
84Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Module interfaces:Inter-transaction scheduling constraints
The FIFO variants have the same interface methods/wires, but differ only in scheduling of the interface transactions
“enq || deq” “deq < enq” “enq < deq”
They have different latency properties NaïveFIFO, PipelineFIFO: minimum 1-tick latency BypassFIFO: minimum 0-tick latency
This can affect “alignment” with associated data on other datapaths
Their control circuits have different properties: PipelineFIFO: “notFull” depends on “deq_enab” BypassFIFO: “notEmpty” depends on “enq_enab”
Their data paths have different properties: BypassFIFO: combinational path from data in to data out (can affect
timing closure)
85Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Module interfaces:Inter-transaction scheduling constraints
“Client” HW that uses one of these FIFOs will be different, depending on which variant is used
Different control logic to obey different scheduling requirements
In RTL, These difference are often undocumented, or poorly
documented, or poorly communicated from FIFO designer to FIFO user
more verification surprises, bugs
With Rule-based Interface Methods Precise vocabulary to specify and communicate scheduling Control HW in client is automatically synthesized to take into
account scheduling differences
86Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Broad-brush differencesbetween BSV and RTL:
Module hierarchy
BSV has exactly the same notion of module hierarchy as RTL
In fact, more stringently so: even registers are modules (at the leaves of the hierarchy). In BSV, ordinary variables never represent registers.
Thus, designers exercise precise control over microarchitecture
“If so, how can BSV be a high-level HDL?” Microarchitecture is the creative (and fun) part of HW design; it
distinguishes good designs from bad. The designer should remain involved in this.
Complex concurrency and control is the hard and tedious part of HW design; it’s where most errors arise. BSV’s Rules dramatically simplify and automate this.
87Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Modules, rules, interfaces, methods
The big picture: modules contain rules which use methods that are provided by sub-modules in their interfaces. Methods, too, can use other methods.interface
state
rule
module
88Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example: a 2x2 switch, with stats
Packets arrive on two input FIFOs, and must be switched to two output FIFOsCertain “interesting packets” must be counted
Dete
rmin
eQ
ueue
Dete
rmin
eQ
ueue
+1
Countcertain packets
89Copyright © Bluespec Inc. 2006 Confidential and Proprietary
2x2 switch specs
Input FIFOs can be emptyOutput FIFOs can be full
Shared resource collision on an output FIFO: if packets available on both input FIFOs, both have same destination,
and destination FIFO is not full
Shared resource collision on counter: if packets available on both input FIFOs, each has different
destination, both output FIFOs are not full, and both packets are “interesting”
Resolve collisions in favor of packets from the first input FIFO
Must have maximum throughput: a packet must move if it can, modulo the above rules
90Copyright © Bluespec Inc. 2006 Confidential and Proprietary
The meat of the BSV code
Dete
rmin
eQ
ueue
Dete
rmin
eQ
ueue
+1
Countcertain packets
module mkSmallSwitch (…); … (* descending_urgency = "r1, r2" *)
rule r1; // for packets from FIFO i1 let x = i1.first; let out = ((x[0] == 0) ? o1 : o2); i1.deq; out.enq (x); if (count(x)) c <= c + 1; endrule
rule r2; // for packets from FIFO i2 let x = i2.first; let out = ((x[0] == 0) ? o1 : o2); i2.deq; out.enq (x); if (count(x)) c <= c + 1; endruleendmodule: mkSmallSwitch
91Copyright © Bluespec Inc. 2006 Confidential and Proprietary
CommentaryMuxing into output FIFOs, and control of those muxes, automatically generated
Automatic handling of FIFO emptiness, FIFO fullness This is part of BSV’s rule and interface method semantics
Impossible to read a junk value from an empty FIFO Impossible to enqueue into a full FIFO Impossible to race for multiple enqueues onto a FIFO
All control for resource sharing handled automatically Rule atomicity ensures consistency The “descending_urgency” attribute resolves collisions in favor of rule
r1
The BSV code directly expresses design intent without all the clutter of control and shared-resource mgmt generating efficient, correct-by-construction RTL
92Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Managing change
Now imagine the following changes to the existing code: Some packets are multicast (go to both FIFOs) Some packets are dropped (go to no FIFO) More complex arbitration
FIFO collision: in favor of r1 Counter collision: in favor of r2 Fair scheduling
Several counters for several kinds of interesting packets Non-exclusive counters (e.g., IP packets include TCP packets) M input FIFOs, N output FIFOs (parameterized)
Suppose these changes are required 6 months after original coding
In BSV these are easy, because the source code remains uncluttered by all the
complex control and mux logic atomicity ensures correctness
93Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Broad-brush differencesbetween BSV and RTL:
BSV is not simulation-centric
RTL and SystemC are simulation-centric“Synthesizable subsets” were defined laterMany concepts/constructs are a consequence of this SW-process-like simulation view. E.g.,
Execution of a process has a program-counter-like “locus of control” Variables have the semantics of updatable memory locations,
updated when “execution reaches this statement” Sensitivity lists “If execution reaches this statement, the wire is driven with the value
of the right-hand side” Functions/procedures get called, execute, and return (stack like
semantics)
None of these are particularly meaningful from a HW point of view: the tail (simulation) is wagging the dog (HW description)
BSV is not simulation-centric, and in these respects, BSV is closer to traditional HW view
94Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Broad-brush differencesbetween BSV and RTL:
Datapaths and control paths
With BSV you don’t think separately about datapaths and control
Each Rule specifies the part of the datapath relevant for its behavior, and the control conditions under which the path is traversedThe Bluespec compiler combines these specifications to generate the final datapaths and control circuitry
No central datapath description No central “control FSM”
95Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Interface abstraction
96Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Interface abstractionExamples of BSV hierarchical and polymorphic interfaces (all synthesizable):
interface Put#(t); method Action put(t x);endinterface
interface Get#(t); … endinterface
interface Client#(reqType, respType); interface Get#(reqType) request; interface Put#(respType) response;endinterface
interface Server#(reqType, respType); interface Put#(reqType) request; interface Get#(respType) response;endinterface
interface DMA#(busReq, busResp) interface Client#(busReq, busResp) dataMover; interface Server#(busReq, busResp) config;endinterface
97Copyright © Bluespec Inc. 2006 Confidential and Proprietary
client
Client/Server interfaces
Get/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose
interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface
interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface
data
read
y
enab
ledata
enable
ready
getserver
data
read
y
enab
ledata
enable
readyget put
put
req_t resp_t
98Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Client/Server interfacesinterface CacheIfc; interface Server#(Req_t, Resp_t) ipc; interface Client#(Req_t, Resp_t) icm;endinterface
module mkCache (CacheIfc); // from / to processor FIFO#(Req_t) p2c <- mkFIFO; FIFO#(Resp_t) c2p <- mkFIFO;
// to / from memory FIFO#(Req_t) c2m <- mkFIFO; FIFO#(Resp_t) m2c <- mkFIFO;
… rules expressing cache logic …
interface ipc = fifosToServer (p2c, c2p);
interface icm = fifosToClient (c2m, m2c);endmodule
mkCache
getputserver
clientget put
getputserver
clientget put
mkMem
mkProcessor
99Copyright © Bluespec Inc. 2006 Confidential and Proprietary
mkConnection
Using these interface facilities, assembling systems becomes very easy
interface CacheIfc; interface Server#(Req_t, Resp_t) ipc; interface Client#(Req_t, Resp_t) icm;endinterface
module mkTopLevel (…) // instantiate subsystems Client #(Req_t, Resp_t) p <- mkProcessor; Cache_Ifc #(Req_t, Resp_t) c <- mkCache; Server #(Req_t, Resp_t) m <- mkMem;
// instantiate connects mkConnection (p, c.ipc); mkConnection (c.icm, m);endmodule
mkCache
getputserver (ipc)
client (icm)get put
getputserver
clientget put
mkMem
mkProcessor
100Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
101Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Str
uctu
ral
Beh
avio
ral
Bluespec SystemVerilog™A one-slide overview
Rules and Interface Methods
For complex concurrency and control, across multiple shared resources, across module boundaries
Two dimensions raising the level of abstraction (fully synthesizable)
VHDL/Verilog/SystemVerilog/SystemC
Bluespec SystemVerilog
High-level abstract typesPowerful static checking
Powerful parameterizationPowerful static elaboration
Advanced clock management
102Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Structural abstractions
The behavioral abstractions (Rules and Interface Methods), by themselves, tremendously improve productivity and correctness
A designer can be productive with Rules and Interface Methods after about a day of training
The structural abstractions (types, parameterization, static checking, elaboration) are an additional substantial multiplier
103Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example:a butterfly switch (crossbar)
Basic building blocks:
Recursive construction: 1x1 2x2 4x4 … NxN
00
01
10
11
104Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Butterfly switch: code excerpts
Polymorphic (type parameter t)Sub-interfaces (hierarchical)Aggregation (lists, vectors of interfaces)
interface XBar #(type t); interface List#(Put#(t)) input_ports; interface List#(Get#(t)) output_ports;endinterface
105Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Butterfly switch: code excerpts
Size parameter: lognComb. circuit parameter: destinationOfModule parameter: mkMerge2x1
Encapsulates flow-control, arbitration, queueing behavior of the 2x1 merge
Interfaces instead of port lists: XBar#(t)Polymorphic: type parameter t
module mkXBar #(Integer logn, // param function Bit #(32) destinationOf (t x), // param module #(Merge2x1 #(t)) mkMerge2x1) // param (XBar #(t)) // interface …endmodule: mkXBar
106Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Butterfly switch: code excerpts
Arbitrary elaboration (here: conditional, recursion, loop)
All constructs can be elaborated (first class modules, interfaces, rules, …)
module mkXBar #(Integer logn, …) if (logn == 0) … // BASE CASE FIFO#(t) f <- mkFIFO; … else … // RECURSIVE CASE XBar#(t) upper <- mkXBar (logn-1, …); XBar#(t) lower <- mkXBar (logn-1, …); … for (Integer j = 0; j < n; j = j + 1) … rule route; … if (! flip) merges [j] .iport0.put (x); else merges [jFlipped].iport1.put (x); endruleendmodule: mkXBar
107Copyright © Bluespec Inc. 2006 Confidential and Proprietary
(see also whitepaper and/or demo for full code)
Summary:- Advanced parameterization- Recursive elaboration- The switch itself: < 60 lines of BSV code- First working (tested) prototype: < 1 day (including simple testbench)- Fully synthesizable:
- synthesized to netlist (Magma, tsmc0.18u, 500 MHz)
Butterfly switch
108Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example:parameterized, pipelined, priority queue (P3Q)
enq: insertion point depends on “priority”
deq
Must be synthesizable to quality HWMust allow simultaneous (same clock) enq/deqMust be parameterized with:
Capacity of queue Item-type (data type of items being queued) Precise bit-representation of item-type Priority function (“item1 <= item2”) Pipelined (2-clock) or non-pipelined (1-clock) enq op
(to allow synthesis at range of clock speeds) Pipelining should not affect external enqdeq latency
Specs:
109Copyright © Bluespec Inc. 2006 Confidential and Proprietary
P3Q in Bluespec SystemVerilog
Written, tested, synthesized in ~ 3 daysAbout 610 lines of understandable, well commented code
(~ 400 lines if ignore comments)
Synthesized at 400 MHz (Magma, TSMC 0.18u)(see white paper)
Compares very well with solutions in any other SW programming language or HDL!
Quote from expert commercial architect/designer who specified this problem: “I expect this to be a 10X improvement over what we do today”
110Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
111Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Str
uctu
ral
Beh
avio
ral
Bluespec SystemVerilog™A one slide overview
Rules and Interface Methods
For complex concurrency and control, across multiple shared resources, across module boundaries
Two dimensions raising the level of abstraction (fully synthesizable)
VHDL/Verilog/SystemVerilog/SystemC
Bluespec SystemVerilog
High level abstract typesPowerful static checking
Powerful parameterizationPowerful static elaboration
Advanced clock management
112Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Advanced clock management
Clock domains: Clock abstract type, with static checking of clock compatibility So, impossible to connect across clock domains without a
synchronizer Rich, user extensible library of synchronizers
Gated clocks, for power management Clock gating conditions contribute to Rule conditions So, impossible to communicate with a clock domain that is
gated “off”
113Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Power management:Multiple clock domains
One of the most effective ways to control power consumption
Divide the design into “islands” or “domains” that use a common clocking discipline
Run each domain at the slowest clock speed that is adequate to meet performance specs
“Gate”-off clocks to domains that are currently not being used
E.g., digital camera circuits in a cell phone when the camera is not in use
114Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Multiple clock domains:Typical design rules
Always use a “synchronizer” at domain boundaries Unless the two clocks only differ in gating (same underlying
“oscillator”)
Do not communicate with a gated-off domain But you may still need to read “most recent values” before the
clock was gated off
“Ignore” timing violations in synchronizers By definition they violate clock timing discipline “False paths” in synthesis constraints
115Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Multiple clock domains: enforcingdesign rules in BSV
BSV treats Clock as a special abstract data type Distinguised from all other types Type-checking ensures that clocks never get mixed up with ordinary
signals For clock dividers, BSV provides only “trusted” primitives for deriving
the divided clock from an existing clock For clock generation, BSV provides only “trusted” primitives for
elevating an ordinary signal into a Clock
Clocks can be used in expressions, parameters, arguments, arrays, …; type-checking ensures safety
Clock c1;Clock c = (b ? c1 : c2);//b must be known at compile-time
116Copyright © Bluespec Inc. 2006 Confidential and Proprietary
BSV provides primitives to associate a boolean signal with a Clock, as a gating signal
New gating signals are “ANDed” with existing gating signalsCompiler keeps track fact that c0, c1 and c2 differ only in gating signals (have a common oscillator)
c0, c1 and c2 are said to be “in the same clock family”
Multiple clock domains: enforcingdesign rules in the design language
Bool b1 = …;Bool b2 = …;Clock c1 <- mkGatedClock (b1, clocked_by c0);Clock c2 <- mkGatedClock (b2, clocked_by c1);
117Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Multiple clock domains: enforcingdesign rules in the design language
When instantiating a module, can connect Clocks as usual Type-checking ensures that only a Clock signal can be connected to a
Clock port
Statically checked rules also ensure that each Rule based Interface Method of the instantiated
module is clocked with a unique Clock keeps track of which method is clocked by which Clock
IfcType ifc <- mkModule (…, c1, clocked_by c0)
… ifc.method_A () … // clocked by c1… ifc.method_B () … // clocked by c0
118Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Multiple clock domains: enforcingdesign rules in the design language
In every Rule, type-checking ensures that all the methods used in the rule have a “compatible” clock (same clock family)
mod1.method1, mod2.method2 and mod3.method3 must have the same clock (or be in the same family)
If not, a static error is raised by the compiler If, e.g., mod1.method1 has a different clock, the designer must insert
a synchronizer module between mod1.method1 and its use in this rule, to resolve the incompatibility
rule foo (5 < mod1.method1()); let x = mod2.method2 (True); mod3.method3 (x, x+1);endrule
119Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Multiple clock domains: enforcingdesign rules in the design language
In every Rule, clock gating conditions are “ANDed” with the rule condition
The rule will not execute if any of the clocks of any of the methods is gated off
Therefore, will not attempt to communicate with a method that is gated off
rule foo (5 < mod1.method1()); let x = mod2.method2 (True); mod3.method3 (x, x+1);endrule
120Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Power management:Multiple clock domains — summary
Today’s SoCs have numerous clock domains: Different IP blocks run at different clock speeds For power management
Abstract types, type checking, and clock tracking can eliminate many of the common errors made by designers in managing multiple clock domains
Clean clocks: cannot accidently use a (possibly skewed) signal for a clock
Cannot accidently connect across clock domain boundaries of unrelated clocks without using a synchronizer
Cannot accidentally communicate with a module whose clock is currently gated off
Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example: USB 2.0 UTMI
USB Host
USB Device
USB2.0
USB 2.0TransceiverMacrocell
(UTMI)
SerialInterfaceEngine
DeviceSpecificLogic
Source: UTMI specification, version 1.05
USB PHY, includes:• Data serialization/
deserialization• Bit stuffing• Clock recovery and
synchronization- Including 480 Mbps serial mode
Copyright © Bluespec Inc. 2006 Confidential and Proprietary
UTMI Implementation
USB 2.0TransceiverMacrocell
BSV Implementation
Seria
l Inte
rface
En
gin
e (S
IE)
(30 M
Hz)
Receiver (120 MHz)
Transmitter (120 MHz)
480 MHz Input Clocks (8)
16
16
ReceiveReceiveWord
Analo
g Fro
nt E
ndTransmit
WordTransmit
Physica
l Inte
rface
(480/1
2 M
Hz)
48
4 4
USB 2.012 MHzGenerated Clock
PhyOut
480 MHz Input Clock
Over-
sam
ple
r
13 Clock Domains!
Copyright © Bluespec Inc. 2006 Confidential and Proprietary
UTMI implementation notes
Developed by one engineer in 3 monthsVerified with Cadence eVC testbenchTransmitter & receiver are separable componentsSynthesizes at 480 MHz in TSMC 0.18 using Magma with positive slack
Absolutely no runtime clock debugging!
124Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Reuse
125Copyright © Bluespec Inc. 2006 Confidential and Proprietary
About ReuseIP Reuse has traditionally been difficult because of
inflexibility: IP block can’t be “adjusted” for different application imprecision: Undocumented scheduling/protocol assumptions
All the language-based ideas we have discussed improve the situation:
Rules and Rule based Interface Methods Express complex concurrency across shared resources succinctly and
naturally Eliminate typical control-logic design errors, including race-conditions, by
automatically synthesizing the correct control logic Types, type-checking and clock-checking eliminate careless mistakes
by designers Polymorphism and parameterization allow defining generic IP blocks
that can be instantiated in widely differing contexts Full-power static elaboration allows very succinct expression of
regular structures, dramatically reducing code size, and eliminating tedium and careless mistakes in “cut-and-paste” manual replication
126Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
127Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogfor Testbenches
128Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification is still a bottleneck
TB complexity grows along with exploding complexity in DUTs Complex TB behaviors (simultaneous stimulus on multiple ports,
pipelining, out-of-order processing) Mixing new and old IPs in SoCs Inadequate facilities to construct libraries of common TB design
patterns
Inadequate interface semantics Complex data types Complex interface protocols Difficult to refine from TLM to Implementation Level
Limited Parameterization and therefore reuse of Verification IPs, Transactors, etc.
Bluespec’s strengths can remove these bottlenecks
129Copyright © Bluespec Inc. 2006 Confidential and Proprietary
BSV improves verification:for the Testbench
Testbenches enjoy the same benefits: Express complex concurrency correctly with Rules
State-machine generation Succinct expression of stimulus patterns
Correct connection to DUT Interface Methods are naturally transactional
Interface abstraction allows high-level interfaces No interface timing errors Clock discipline
Reuse due to parameterization
130Copyright © Bluespec Inc. 2006 Confidential and Proprietary
State machine generation
Easy to specify precise orchestration of stimulus sequencing, parallel, iteration
Same Rule semantics automatically flow-controlled, robust to latency variations, etc
// Specify an FSM generating a test seqenceStmt test_seq = seq for (i <= 0; i < NI; i <= i + 1) // each input for (j <= 0; j < NJ; j <= j + 1) begin //each output let pkt <- gen_packet (); send_packet (i, j, pkt); // test i-j path in isolation end par // test packet arbitration by sending packets in parallel send_packet (0, 1, pkt0); // to output 1 send_packet (1, 1, pkt1); // to output 1 (collision) endpar endseq// Generate the FSMmkAutoFSM (test_seq);
Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example:
An Ethernet MAC testbench is created that corresponds to an existing SV TB. The testbench is quickly extended to
create a switch for more real life testing at a fraction of the effort it would take to write and debug the original SV TB
132Copyright © Bluespec Inc. 2006 Confidential and Proprietary
MAC Testbench Structure
MAC
PHY
Fra
me
So
urc
e
Sin
kM
II In
terf
ace
Inte
rrupts
DUT (MAC)Slave WB IFC M
II Interface
Master WB IFC
RAM
Slave WB IFC
SWEMSoftwareEmulator
Master WB IFC
Fra
me
So
urc
e
Sin
k
Test DUT Receiving PacketsTest DUT Transmitting Packets
Bluespec
Verilog95
133Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Adding concurrency
Untimed Tb ~7000 lines of codeNo concurrency managementStand-alone checking
Timed Tb ~2600 lines of codeGeneralized Wishbone Model Includes infrastructure to handle concurrency
Router EnvironmentParameterizedVerification Environment With Concurrency
Original SV Tb
Generalized Switch
New Tb
134Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Extended Example
Combine DUTs into router/switch Multiple DUTs Packet Routing across Wishbone bus Wishbone now includes round-robin arbiter.
Little additional code required Wishbone bus etc. already generalized Instantiate multiple DUTs Add Arbiter/Bank Add serialization code (Frame -> WB)
Original ~2583 relevant lines
Modified ~2957 relevant lines
135Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Original Testbench Structure
MAC
PHY
Fra
me
So
urc
e
Sin
kM
II In
terf
ace
Inte
rrupts
DUT (MAC)
MII Interface
Master WB IFC
RAM
Slave WB IFC
Slave WB IFC
SWEMSoftwareEmulator
Master WB IFC
FrameSource
Sink
Bluespec
Verilog95
136Copyright © Bluespec Inc. 2006 Confidential and Proprietary
MAC Extended Example (as Router)W
ishbone B
us
Arbiter
FrameSource
SinkWBSerializer
M/S
WB
IF
C
Current TBCurrent Tb
AddressBank
FrameSource
SinkWBSerializer
M/S
WB
IF
C
Current TBCurrent Tb
FrameSource
SinkWBSerializer
M/S
WB
IF
C
Current TBCurrent Tb
Bluespec
Verilog95
137Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
138Copyright © Bluespec Inc. 2006 Confidential and Proprietary
BSV forSoC (System on a Chip) design
a.k.a.
ESL (Electronic System Level) design
139Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Today’s chips: “SoC”s(System on a Chip)
“IP” blocks (“Intellectual Property”)
ProcessorsCaches, MemoriesInterconnectsDMAsOther peripheral blocksI/O blocks
E.g., cell phones, cell network base stations, TV set-top boxes, iPods, digital cameras, …
System Bus
Peripheral Bus
BusBridge
MemoryControllerProcessor
DMAController
DSP
PowerManagement
Arbitration
ApplicationSpecific
DRAMSRAM
L2Cache
Ctlr
SerialController
Audio VideoFlash/Mem
I/FBus
Controller
140Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Design Issues
Complex tradeoffs in deciding architectures; need early HW architecture metrics:
Processor power, cache organization, bus and interconnect sizing, latencies, throughputs
Pipelined transactions, bursts, out-of-order processing
SW development needs to begin before HW is ready
Simulation speed (“boot the OS on the processor and run the video app thru the MPEG decoder HW IP block”)
Simulation speed inversely related to level of detail being simulated
141Copyright © Bluespec Inc. 2006 Confidential and Proprietary
TLM: Transaction Level Models
TLM is a level of abstraction well above the hardware implementation level, based on “transactions” at module interfaces
E.g., “send an Ethernet packet”, “read a disk sector” Instead of:
“send a byte/word” Wait for RDY, assert DATA_IN, assert ENABLE
Advantages: Models can be built quickly Capture essential functionality and essential structure Provide an enviroment for early development of embedded
software Much faster simulation
142Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Ideal: One consistent platformfor system exploration & design
Models
ImplementationImplementation Implementation
Architecture dimension
Abstraction/refinementdimension
TransactionModels
TransactionModels
TransactionModels
143Copyright © Bluespec Inc. 2006 Confidential and Proprietary
BSV: single-language methodology
BSVTools allow embedding C code (for embedded SW, early modelling)
Interface methods are naturally “transactional”
Interfaces can express complex interactions
Rules are naturally “reactive”
Types, parameterization, abstraction comparable to C++
Rules make it easier to express complex concurrency (due to atomicity)
HW metrics available from the beginning, for architecture decisions
Good HW synthesis exists
Single language environment, with strong semantics to enable disciplined refinement, testbench reuse, etc.
HW Implementationin BSV
TransactionModel in BSV
(with embedded C)
refinement
144Copyright © Bluespec Inc. 2006 Confidential and Proprietary
The importance ofrapid architecture exploration
Can you estimate the hardware size of an IP block, just by looking at the spec?
Let’s look at what happened in three actual design activities:
LPM (Longest Prefix Match) in Internet Packet Router MIPS processor 2-stage pipeline 802.11a transmitter
145Copyright © Bluespec Inc. 2006 Confidential and Proprietary
18
2
3
IP address Result M Ref
7.13.7.3 F
10.18.201.5 F
7.14.7.2
5.13.7.2 E
10.18.200.7 C
A lookup table (sparse tree) for LPM
3
A…
A…
B
C…
C…
5 D
F…
F…
14
A…
A…
7
F…
F…
200
F…
F…
F*
E5.*.*.*
D10.18.200.5
C10.18.200.*
B7.14.7.3
A7.14.*.* F…F…
F
F…
E5
7
10
255
0
14
4A Real-world lookup algorithms are more complex but all make a sequence of dependent memory references.
146Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Software version of LPM (in C)
intlpm (IPADDRESS ipa){
int p;
p = RAM [ipa >> 16]; // level 1 lookup (16b)if (isLeaf(p)) return p;
p = RAM [p + (ipa >> 8) & 0xFF]; // level 2 lookup (8b)if (isLeaf(p)) return p;
p = RAM [p + ipa & 0xFF]; // level 3 lookup (8b)return p;
}
Note: the C code says nothing about goodmicroarchitectures for HW implementation
147Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Longest Prefix Match for IP lookup
Static pipeline
Inefficient memory usage but simple design
Linear pipeline
Efficient memory usage through memory port replicator
Circular pipeline
Efficient memory with most complex control
Designer’s Ranking:
1 2 3Which is “best”?
Arvind, Nikhil, Rosenband & Dave ICCAD 2004
Even for such a small function, 3 dramatically different architectures (no doubt many more possibilities)
148Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Synthesis results
Microarchitecture is by far the most significant determinant of HW quality
Even for an apparently “fixed” microarchitecture, clever microarchitecture optimization can have a dramatic effect
(Static V, I vs Static V, II)
LPM versions Best Area(gates)
Best Speed(ns)
Static V, I 8898 3.60
Static V, II 2271 3.56
Static BSV 2391 (5% larger) 3.32 (7% faster)
Linear V 14759 4.7
Linear BSV 15910 (8% larger) 4.7 (same)
Circular V 8103 3.62
Circular BSV 8170 (1% larger) 3.67 (2% slower)
V = Verilog; BSV = Bluespec SystemVerilog, TSMC 0.18 µm
149Copyright © Bluespec Inc. 2006 Confidential and Proprietary
(In)applicability of “behavioral synthesis”
Traditional “behavioral synthesis” has a hard time with this example (just 10 lines of C!)
Hard to analyze variable number of memory reads that are data-dependent on each other
Hard to interleave them to access a single shared resource (memory)
Designer creativity needed to improve “Static V I” from “Static V II” (clever sharing of state machine)
Designer creativity needed to come up with circular pipeline
150Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Design Activity 2
MIT postgraduate course: 6.884 Complex Digital Systems, Spring 2005 (see http://csg.csail.mit.edu/6.884/index.html)
Lab task: design and synthesize a simple MIPS 2-stage processor pipeline
Can there really be much variation in this? The next slide shows the variation in HW quality across
the different lab project teams
151Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Lab 2 Results
Pareto-Optimal Points
Source: http://csg.csail.mit.edu/6.884/lab2-results.html
152Copyright © Bluespec Inc. 2006 Confidential and Proprietary
802.11a transmitter
153Copyright © Bluespec Inc. 2006 Confidential and Proprietary
802.11a: What’s the optimal implementation for power, area,
performance?
802.11a Wi-Fi transmitter targeted at a wireless platformFinal design: 4 milliwatts
Source: Dave, Pellauer, Gerding & ArvindSource: Dave, Pellauer, Gerding & Arvind
PowerCharacterization
PowerCharacterization
RTL for NewMacro-/Micro-Architecture
RTL for NewMacro-/Micro-Architecture
Controller
Scrambler
Encoder
Interleaver Mapper
IFFTCyclicExtend
accounts for > 95% area
154Copyright © Bluespec Inc. 2006 Confidential and Proprietary
IFFT:Micro-architectural exploration
in0
in1
in2
in3
in4
in59
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
radix4
Perm
utation 0
Perm
utation 1
Perm
utation 2
…
in63
in60
in61
in62
…
out0
out1
out2
out3
out4
out59
out63
out60
out61
out62
Sh
arin
g r
adix
4’s?
Folding stages?
Each stage’s 16 radix4 blocks could be also implemented with8, 4, 2 or 1 radix4 block(s) used over multiple cycles
Each stage is almost identical, why not fold and re-use what you can?
+
-
+
-
X
X
X
X
+
+
-
-
x[0]
t[0]
x[1]
t[1]
x[2]
t[2]
x[3]
t[3] *I
rots
tem
p
retv
Each of the 48 radix4 blocks looks like this
155Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Superfolded circular pipeline:Just one Radix-4 node!
in0
…
in1
in2
in63
in3
in4
…
out0
out1
out2
out63
out3
out4
Radix 4
Perm
ute
_1Perm
ute
_2Perm
ute
_3
Stage Counter 0 to 2
Index Counter 0 to 15
64
, 4-w
ay
Muxes
4, 1
6-w
ay
Muxes
4, 1
6-w
ay
DeM
uxes
Designer intuition:
Most efficient design lowest power
156Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Synchronous pipeline
rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule
xsReg1inQ
f1 f2 f3
sReg2 outQ
This is real IFFT code; just replace f1, f2 and f3 with stage_f code
157Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Folded pipeline
x
sReginQ
rule folded-pipeline (True); if (stage==1) begin inQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1;endrule
f
outQstage
f1
f2
f3
function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcaseendfunction
This is real IFFT code too ...
158Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Performance results
7 combinations created and explored within 5 days
Designers were astounded to find that their intuitions were
wrong and that the critical areas for reducing power were not
where they suspected
802.11a Design(by IFFT block type)
Area(um^2)
Symbol Latency(cycles)
Throughput(clks/
symbol)
Min frequency required (MHz)
Average Power(mW)
Combinational 4.91 10 4 1.0 3.99Pipelined 5.25 12 4 1.0 4.92
Folded - 16 radix4 3.97 12 4 1.0 7.27Folded - 8 radix4 3.69 15 6 1.5 10.9Folded - 4 radix4 2.45 21 12 3.0 14.4Folded - 2 radix4 1.84 33 24 6.0 21.1Folded - 1 radix4 1.52 57 48 12.0 34.6 Original designer
intuitionOriginal designerintuition
Optimal powerOptimal power
159Copyright © Bluespec Inc. 2006 Confidential and Proprietary
BSV Advantages
It is essential to do architectural exploration for better (area, power, performance, ...) designs
Bluespec enables rapid architectural exploration
Fast, low-effort, low-risk changes enable: Rapid architectural/micro-architectural exploration and optimization Nimble responses to:
Feature/spec changes Timing closure challenges Bug fixes Area optimizations
160Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Architecture exploration: summary
Despite the self-image of many experienced engineers, there is a wide margin of error in estimating size of IP blocks without actually prototyping them (working out microarchitectures)
A bad estimate will leave you stuck with a sub-optimal design
So, Transaction Level Modeling, and their quick refinement to realistic hardware, are essential for accurate evaluation of candidate architectures
Essential to have a design language that supports this High levels of abstraction High levels of static checking and elaboration Synthesis from high level into quality hardware
161Copyright © Bluespec Inc. 2006 Confidential and Proprietary
BSV for architectural exploration
Rules and Interface Methods are “transactional” in nature
Can be written at very high level(in addition to the microarchitectural level)
E.g., module interconnection using highly parameterized Get/Put interfaces
From complete packets to bits Similar to SystemC TLM
Clear semantics for splitting, joining, adding, removing Rich theory developed over many decades in Computer Science Enables disciplined refinement
162Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Methods and Transaction Level Modeling
Each method can be read as a transaction that can be applied against a module
By just changing the level of abstraction of the arguments and results, we can move from realistic hardware to high-level models, using the single paradigm of methods
Get#(Bit#(16)) m <- mkM;Put#(Bit#(16)) n <- mkN;
rule r1 (…); Bit#(16) x <- m.get(); n.put (x);endrule
Get#(EtherPacket) m <- mkM;Put#(EtherPacket) n <- mkN;
rule r1 (…); EtherPacket x <- m.get(); n.put (x);endrule
Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example:
A reference platform is created to test a device driver for a hard disk microdrive. The reference platform allows either
the device driver or the hardware model to be swapped out for the actual implementation. The model is instrumented
with Assertions for monitoring transactions.
164Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Creating a Reference Platform
RoS (Rest of System)
DD (Device Driver)
HW (IDE disk)
Monitor
RoS periodically initiatesa disk sector R/W transfer,and continues concurrentactivity (non-blocking)
Converts sector transfer requestsinto IDE protocol consisting ofIDE register R/Ws, andresponding to IDE HW interrupts
Models IDE registercommand block and sectordata buffer, and behaviorin response to IDE commandswritten into IDE commandregister
Interrupts
Callbacks
Monitors allinter-block traffic,checks forimmediate andtemporalcorrectnessconditions
Handle callbacksasynchronously
Data forPIO regreads
Disk sector R/W requests
IDE register R/Ws
System
165Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Using the reference platform,replacing DD with real C code
RoS (Rest of System)
DD (Device Driver)
HW (IDE disk)
Monitor
SystemCosimIn SystemC simulatorAll written in C/C++/SystemC
In BluesimAll written inBSV (samereferencemodel code)
Interface communicationshim code automaticallygenerated by Bluespeccompiler
Communicationsare simplefunction calls
DD nowwritten in C
Other C/C++/SystemC code
Understanding IDE: 2 weeks Coding and Verification: 5 days Integrating “C” Driver: 3 days
166Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Example:Amba AHB bus system,from transactional levelto implementation level
167Copyright © Bluespec Inc. 2006 Confidential and Proprietary
E.g., Amba AHB bus:transactional level (get/put)
Bus master-side transactional interface
Bus slave-side transactional interface
Slave transactional interface
Master transactional interface
Slave Block
Direct transactional interconnect(for faster simulation)
Master Block
Bus master-side transactional interface
Bus slave-side transactional interface
Slave transactional interface
Master transactional interface
Slave Block
Master Block
168Copyright © Bluespec Inc. 2006 Confidential and Proprietary
E.g., Amba AHB bus: mixed transactional/implementation levels
Bus master-side interface
Master interface
Bus slave-side interface
Slave interface
Bus master-side transactional interface
Bus master-side interface
Bus slave-side interface
Bus slave-side transactional interface
Master interface
Slave interface
Slave transactional interface
Master transactional interface
Master Block
Slave Block
Slave Block
AHB
adapter
Master Block
adapter
169Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Amba AHB bus:Implementation level
Bus master-side interface
Master interface
Bus slave-side interface
Slave interface
Master Block
Slave Block
AHBBus master-side interface
Master interface
Bus slave-side interface
Slave interface
Master Block
Slave Block
170Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
171Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Many proof points demonstrating- General applicability,- Productivity- HW quality
172Copyright © Bluespec Inc. 2006 Confidential and Proprietary
System Bus
Peripheral Bus
BusBridge
MemoryControllerProcessor
DMAController
DSP
PowerManagement
Arbitration
ApplicationSpecific
DRAMSRAM
L2Cache
Ctlr
SerialController
Audio VideoFlash/Mem
I/FBus
Controller
Algorithms(e.g.
DSP/math)
Algorithms(e.g.
DSP/math)
ComplexDatapaths
(e.g.processor/controller)
ComplexDatapaths
(e.g.processor/controller)
ControlControlAlgorithms
(e.g. DSP/math)
Algorithms(e.g.
DSP/math)
ComplexDatapaths
(e.g.processor/controller)
ComplexDatapaths
(e.g.processor/controller)
ControlControl
System Bus
Peripheral Bus
BusBridge
MemoryControllerProcessor
DMAController
DSP
PowerManagement
Arbitration
ApplicationSpecific
DRAMSRAM
L2Cache
SerialController
Audio VideoFlash/Mem
I/FBus
Controller
“RISC” processorMIPSItaniumPowerPCARM
L2 cache ctlr
Bus converters
AMBA DMA ctlr
802.11aNetwork procQueuing enginesSorting queueArbiterIP lookupDebug controller
PCI ExpressUSB
Pixel processorWaveform generatorPong
IDCTMotion compensatorDESMPEG-4IFFT
DDR2 ctlrSRAM ctlr
FIR filter
Bluespec is the only next generation solution that addresses control and
complex datapaths
Everyone else only addresses this
application space
I2CPCI-X
OCPinterconnect
Bluespec has been used for every design listed:
Designs with Bluespec
173Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec vs. Hand-coded RTL
Better Same
Bluespec vs. RTL (Area Optimized)
-12345678910
10%+(smaller)
5 to 10%(smaller)
0.5 to 5%(smaller)
-0.5 to0.5%
-0.5 to -5%(larger)
-5 to -10%(larger)
-10%-(larger)
Bluespec Area Relative to Hand-Designed RTL
Nu
mb
er o
f T
est
Cas
es
7 Designs 18 Designs
Bluespec vs. RTL (Speed Optimized)
-
2
4
6
8
10
12
14
16
10%+(faster)
5 to 10%(faster)
0.5 to 5%(faster)
-0.5 to0.5%
-0.5 to -5%(slower)
-5 to -10%(slower)
-10%-(slower)
Bluespec Speed Relative to Hand-Designed RTL
Nu
mb
er
of
Te
st
Ca
se
s
5Designs 20 Designs
Hand-coded RTL (area)
Bluespec RTL (area)
Hand-coded RTL (time)
Bluespec RTL (time)
1 Gray code converter 9 9 1.56 1.56
2 Priority encoder 21 21 2 2
3 Parity checker 23 23 3.94 3.94
4 Read/write FSM 20 20 3.93 3.93
5 Barrel shifter 34 34 1.65 1.65
6 Speed FSM 33 33 4.86 4.75
7 Ripple adder 86 86 10.98 10.98
8 Angular FSM 66 63 4.97 4.99
9 One-hot encoded FSM 99 116 5.76 5.97
10 Pattern detecter 67 67 5.98 5.74
11 Wallace multiplier 142 141 9.6 9.91
12 Handshake protocol 109 112 5.99 5.98
13 Traffic light controller 215 211 9.93 10
14 Rotors controller 259 249 8.93 9
15 Sequential multiplier 340 361 8.93 8.93
16 Shift adder 391 399 10 9.85
17 Three-way roundrobin 399 385 9.73 9.06
18 Divider 350 347 27.65 27.65
19 Cache coherence 352 382 10.36 10.5
20 Booth multiplier 974 822 14.91 14.92
21 Fibonacci 914 877 14.45 13.9
22 LIFO 1764 1850 7.99 7.99
23 FIFO1 1926 2018 14.98 14.2
24 Factorial 1611 1605 34.97 34.95
25 Random number generator 9278 8947 79.51 79.88
Totals 19482 19178 313.56 312.23
Area Optimized Speed Optimized
Test Case
174Copyright © Bluespec Inc. 2006 Confidential and Proprietary
IDCT design results
Verilog BluespecSystemVerilog
RTL coding & unit verification 2.5 man-weeks 1.3 man-weeks
Top level verification 1.5 man-weeks 1.2 man-weeks
Total effort 4 man-weeks 2.5 man-weeks
Lines of code 2716 723
Latency (IO) in clock cycles 172 171
Gate count (2-input NAND; excluding memory)
52K 48K
175Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Itanium: IA64 in BluespecWunderlich & Hoe
Roland WunderlichRoland Wunderlich 33
Roland WunderlichRoland Wunderlich 77
Platform CapabilitiesPlatform Capabilities
High speed execution of the Bluespec model, High speed execution of the Bluespec model, runs at 100 MHz, 4 orders of magnitude faster runs at 100 MHz, 4 orders of magnitude faster than than ModelSimModelSim
Full access to the FSB, allowing 800 MB/s cache Full access to the FSB, allowing 800 MB/s cache line reads and writes, plus a control channel to line reads and writes, plus a control channel to the Pentium III processor via mapped I/Othe Pentium III processor via mapped I/O
Large FPGA resources, the current design Large FPGA resources, the current design occupies less than 30% of the FPGA resourcesoccupies less than 30% of the FPGA resources
Roland WunderlichRoland Wunderlich 55
Memory
Branch
Integer×3
Pipe. Control
Fetch Decode Disperse
Stack Read Execute Write
Stack Read Execute
Stack Read Execute Memory Write
Instr. Cache
FSB Control Data CacheUnified L2
Branch Pred.
Register Set
Write
Stack
Bypass
IPF Microarchitecture ModelIPF Microarchitecture Model
The first model was developed in a few months by one student!
176Copyright © Bluespec Inc. 2006 Confidential and Proprietary
… and numerous other examples
Validated by customer experience
50% less time (or better) to verified, synthesized design
Even with no prior knowledge of BSV
Area and time of synthesized design matched previous implementations done in Verilog/VHDL
Up to multi-million gate designs
177Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
178Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Tools and tool flow
179Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Tools and flow
Bluespec SystemVerilog source
Verilog 95 RTL
Verilog sim
VCD output
Visualization(e.g., Debussy)
Bluespec Synthesis
files
Bluespec tools
3rd party tools
Legend
RTL synthesis
gates
Bluesim CycleAccurate
Blueview(plus other
Verilog/VHDL)
180Copyright © Bluespec Inc. 2006 Confidential and Proprietary
SOURCE RTL
Waves
Interactive Cross-Probingbetween Views (source, RTL, Novas Debussy/Verdi)
181Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
182Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Concurrency Semantics of Rules and Rule-based Interface Methods are also available in SystemC
183Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Why integrate SystemC with Rulesand Rule-based Interface Methods?
Improve SystemC’s concurrency model Atomic transactions vs. threads and events Rule semantics across module boundaries
Provide a path to high-level synthesis for control logic and complex datapaths
Enable use of same model for embedded software development and hardware exploration and hardware implementation
184Copyright © Bluespec Inc. 2006 Confidential and Proprietary
coreSystemC
Standard SystemC tools(gcc, OSCI sim, gdb, …)
+TLM
coreSystemC
classdefs/libs
TLMclass
defs/libs
Ruleclass
defs/libs
Bluespec Synthesizable subsetRefinement+Rules
Bluespecsynthesis tool
RTL
Bluesim
Standard synthesisback-end tools
HW
other Bluespec
tools
+TLM
TLMclass
defs/libs
+TLM
TLMclass
defs/libs
185Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Components
Additional classes and macros (esl.h) Defines Bluespec Modules, Rules, Methods, Interfaces, etc
ESL Analyzer (“esepp”) Parses Modules, Rules, Methods Generates code to call elaborator with callback registrations, etc. Generated code is compiled and linked with the rest of the system Cannot be done with cpp Original modules are not changed by the analyzer and can be
compiled directly by gcc, but must be linked with ESEPP-generated code
Run-time system (libesepro.a) Elaborator
Determines priorities and scheduling ordering of rules and methods, executed
Run-time scheduler “Fires” rules on every clock cycle
186Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Components/flow
gcc
systemc.h
simulation executable
esl.h
dut.cpp
esepp
#include
dut.epp
libsystemc.a esepro.a
Standard SystemC flow Rule classes
187Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Bluespec SystemVerilogAgenda: Technical Deep Dive
Intro: why an HDL can affect overall productivity, from concept to siliconBehavior:
Rules: a new way to express HW behavior Correctness: why rules help Comparison with behavioral synthesis Rule-based Interface Methods: modularizing rules
Structure: improving the expression of HW structure using ideas from advanced programming languagesClock domains and gated clocks: compiler-guaranteed safetyTestbenches using BSVTransaction Level Modeling/architecture exploration and refinement, within a single paradigm
Comparison with SystemCSynthesis quality: as good as hand-coded RTLTool flows
Coexistence with Verilog/VHDL/SV/SystemCFutures:
Integration of Rules and Rule-based Interfaces into SystemC Formal verification
188Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Future: formal verification
189Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification: Formal Methods — why?
So far, Verification = Testing (by simulation) Even the current use of assertions (PSL, SVA) is only
a testing strategy
Unfortunately, the size (# state elements) of todays chips makes it increasingly difficult/ impossible to cover the state space by testing
190Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification: Formal MethodsApproach 1
Use theorem-proving and other methods to prove assertions (rather than just testing assertions during simulation)
Assertions can be written using PSL, SVA, …
Advantage: coverage (assertion is always true, not just for a particular set of test cases)
Caveat: verification can only be as good as the set of assertions being verified!
Do the set of assertions completely specify the design?
191Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification: Formal MethodsApproach 2
Prove the equivalence of a simple reference model with the implementation
E.g., for a processor design: Reference model: one instruction at a time, no pipelining, no
speculation, no cacheing Implementation: full implementation details
Proof method: Define a correspondence between each state in the reference
model and a state in the implementation For each state change in reference model, show that the
implementation moves between corresponding states
192Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification: Formal Methods
Some references on formal verification using Rule semantics
Parallel Program Design: A Foundation, K. Mani Chandy and Jayadev Misra, Addison Wesley, 1988
UNITY programming language for concurrent, reactive systems
Using Term Rewriting Systems to Design and Verify Processors, Arvind and Xiaowei Shen, IEEE Micro 19:3, 1998, p36-46Cache Coherence Verification with TLA+, H. Akhiani, Doligez D., Harter, P., Lamport L., Scheid J., Tuttle M. and Yu Y., Proc. World Congress on Formal Methods in the Development of Computing Systems-Volume II, p.1871-1872, September 20-24, 1999Proofs of Correctness of Cache-Coherence Protocols, Stoy et al, in Formal Methods for Increasing Software Productivity, Berlin, Germany, 2001, Springer-Verlag LNCS 2021Superscalar Processors via Automatic Microarchitecture Transformation, Mieszko Lis, Masters thesis, Dept. of Electrical Eng. and Computer Science, MIT, 2000
193Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Verification: Formal MethodsSummary
Formal methods in verification are not yet in widespread use
Many companies have started using formal methods on an experimental basis
These methods will beome increasingly important as chip complexity increases
Design languages with strong formal semantics will improve the likelihood of success
194Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Summary and wrapup
195Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Str
uctu
ral
Beh
avio
ral
Bluespec SystemVerilog™A one slide overview
Rules and Interface Methods
For complex concurrency and control, across multiple shared resources, across module boundaries
Two dimensions raising the level of abstraction (fully synthesizable)
VHDL/Verilog/SystemVerilog/SystemC
Bluespec SystemVerilog
High-level abstract typesPowerful static checking
Powerful parameterizationPowerful static elaboration
Advanced clock management
196Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Summary
Bluespec is using ideas from advanced programming languages:
Behavior: Rule-based systems, atomic transactions, correctness using
invariants, modularity, achieving performance systematically via Rule-composition semantics, …
Structural correctness, abstraction and elaboration Complex types, abstract types, polymorphism (type
parameterization), systematic overloading Orthogonality (parameterization over all semantically meaningful
concepts, including pieces of behavior) Full programming power for structural descriptions
... to tackle the complexities of modern chip design Both individual HW blocks, and SoCs
197Copyright © Bluespec Inc. 2006 Confidential and Proprietary
Fully synthesizable – without compromise!
Bluespec: Better Design Accelerates Everything!
Architecture
Design
Verification and Test
Physical Design
More architectural flexibility during
design
50% reduction in errors, faster
correction
50% reduction from design to verified
netlist
Architectural exploration
Early executable models
Early executable models
Better reuse
Faster fixes, to achieve closure
198Copyright © Bluespec Inc. 2006 Confidential and Proprietary
End
Thank you for your attention!