reconfigurable computing - university of sydney computing finite state machines school of electrical...
TRANSCRIPT
Reconfigurable Computing Finite State Machines
School of Electrical and Information Engineering
http://www.ee.usyd.edu.au/~phwl
Philip Leong ([email protected])
Tortise: You don't need to give an infinite number of monkeys an infinite amount of time to write Hamlet. A finite number of monkeys and a finite amount of time will do just fine. And like my father used to say, never use an infinite number of monkeys when a finite number will do.
- Mike Schiraldi, in a sci.math posting
Lecture Recordings
http://www.ee.usyd.edu.au/people/philip.leong/hit-rc14.html Slides often change just prior to lecture!
Video recording of lecture
Introduction
› Most practical digital designs built from datapath+control - control implemented as a finite state machine
› Describe how to implement finite state machines - how to construct a FSM
- how to optimize for time and area
3
Simple Example: A memory controller
4
State Transition Table
5
Current State Input Condition
Next State Output
Idle ready Decision OE=0,WE=0 Decision ~rw
rw Write Read
OE=0,WE=0
Write ready Idle OE=0,WE=1 Read ready Idle OE=1,WE=0
• Note that output is only a function of the state (Moore machine) • If input condition not met, state does not change
Next State Logic Output LogicState Registersinputs next_state current_state outputs
State Assignment
6
Current State Input Condition
Next State Output
Idle ready Decision OE=0,WE=0 Decision ~rw
rw Write Read
OE=0,WE=0
Write ready Idle OE=0,WE=1 Read ready Idle OE=1,WE=0
(s1,s0) (n1,n0) Output 00 01 OE=0,WE=0 01 10
11 OE=0,WE=0
10 00 OE=0,WE=1 11 00 OE=1,WE=0
• n1 = ~s1.~s0.ready • n0 = ~s0.~s0.ready+ ~s1.s0.rw • OE = s1.s0 • WE = s1.~s0
Implementation
7
• n1 = ~s1.~s0.ready • n0 = ~s0.~s0.ready + ~s1.s0.rw • OE = s1.s0 • WE = s1.~s0
Speed
8
Next State Logic Output LogicState Registersinputs next_state current_state outputs
› Clock period constraint due to setup time - tclk ≥ tlogic + tsu+ tcq
- where tlogic is the next state logic delay
- tsu is the register setup time (time before the clock edge that data must be stable)
- tcq is the clock to output delay (time after clock edge q might be unstable)
› Clock period constraint due to hold time - tH ≤ tlogic + tcq ≤ tclk
- This is most extreme in the case of a shift register
tcq tco tSU
Optimization for speed/area
› Often there is no need - FPGA usually fast enough so almost any description is acceptable
- in this case use the most straightforward implementation (easiest to write/read/debug)
- otherwise giving timing constraints often does the job
› Optimization - FSM dependent
- most efficient FSM depends on # states, complexity of logic etc
- architecture dependent - different for FPGA/CPLD/ASIC
- speed max freq?, smallest clock to output delay? - interrelated but we can optimize one in favor of another
9
Other FSM architectures
› We will look at 4 different implementations - outputs decoded from state bits combinatorially
- output decoded in parallel output registers
- outputs encoded within state bits
- one hot encoding
10
Outputs decoded from state bits combinatorially
› This is the technique described earlier
› Output logic is a combinatorial function of the state - suffers from an additional delay from output of state bits to the output signals
Next State Logic Output LogicState Registersinputs next_state current_state outputs
11
Glitching
› Consider a state change from 1111 -> 0000 - The 4 bits could change at different speeds e.g. 1111 -> 1011 -> 0011 -> 0000
- Output logic could produce a number of spurious values during current_state transient
- Wastes power and could affect correct operation
12
Next State Logic Output LogicState Registersinputs next_state current_state outputs
Output decoded in parallel output registers
› Decode the outputs before the state bits are registered i.e. at the output of the nextstate
› Outputs are generated earlier than the previous version this is because the output combinatorial logic delay is removed
› Area larger because we have additional registers for the output › Speed we have two combinatorial delays to the output so maximum speed
may be affected › Avoids glitching of outputs
Next State Logic State Registersinputs next_state current_state
outputsOutput Logic Output Registers
13
Outputs encoded within state bits
› An example is a counter - the output is also the state
› Find a state encoding which directly generates the output - design more difficult
- we need to find such an encoding
- code more complicated and difficult to read
- can resolve clashes by adding outputs to differentiate similar bits
- e.g. for the memory controller can add ST0 to differentiate the IDLE and DECISION states
› Efficiency depends on the particular circuit - may be more/less efficient in area/speed, best way is to try it for a critical design
Next State Logic State Registersinputs next_state
current_state
outputs
14
Gray code
› Gray code is arranged so that only one output signal changes (no glitching)
› Can code FSMs so that the state transitions follow Gray codes
› Conversion simple g[i] = b[i] xor b[i+1] (b[4] = 0)
15
One hot encoding
› Use n flip-flops to represent an n-state FSM › Significantly reduces the logic for outputs and nextstate
- why?
- also reduces logic to generate outputs
- at the expense of more registers
› FPGAs - rich in registers
- for max speed and small to medium FSMs often the best choice
16
Mealy machines
› All FSMs we have discussed are Moore machines since outputs are functions only of the current state - for Mealy FSMs, outputs functions of current state and inputs
- output can usually change 1 cycle earlier than Moore
- can have fewer states than Moore
Next State Logic Output LogicState Registersinputs current_state outputsnext_state
17
Fault tolerance
› What happens if for whatever reason the FSM gets into an undefined state - fault tolerant designs can recover
- alternatively could enter an error state that somehow reports
- explicit “don’t care”
18
Fault tolerance for one-hot designs
› There are many more possible illegal states for 1-hot designs - could detect with logic
- this is quite expensive, affects performance and area
- could be pipelined
19
BlockRAM Applications, State Machine
Store next address in the ROM
Conditional jump info is being entered as
additional address inputs
One RAM can be split in two: e.g. One 18K dual-port = two 9K BlockRAMs
with independent everything:
( R/W, clock, address, data content, aspect ratio )
High speed, independent of complexity
Slide courtesy of Peter Alfke 20
Fast State Machine in one BlockRAM
256 states, 4-way branch (or 128 states, 8-way) 36 optional parallel outputs from the second port
High speed operation, independent of code
8 bits 8 + 1 bitsBranchControl 2 bits
Output
BlockROM
1K x 9
256 x 368 bits 36 bits
Slide courtesy of Peter Alfke 21
Metastability
› Recall all registers have setup (tSU) and hold (tH) times. Violating these can lead to metastability in which the specified clock-to-output delay (tCO) is violated.
22 Source: Altera
Mechanism
23
› The mechanism is analagous to a ball dropped on a hill. If setup and hold times are met, the ball is dropped near a stable state.
› Due to noise, it is eventually resolved but this takes an indeterminate amount of time.
Source: Altera
Modeling Metastability
24
› The mean time between failure can be modeled as
- where C1 and C2 are process constants
- fCLK, fDATA are clock frequency of receiving signal and toggle rate of asynchronous input
- tMET is timing slack available beyond the register’s tCO, for a potentially metastable signal to resolve to a known value
- tMET/C2 has largest effect and increases in transistor speed improve metastability MTBF
Synchronization Chains
25
› Metastability can be virtually eliminated (with an increase in latency) via synchronization chains. tMET is sum of the output timing slacks for each register in the chain - e.g. if tMET is 50 ps, increasing to 1 ns improves MTBF by e20=5×108 times
› Signals on different clock domains can be handled with a dual-clock FIFO which contains synchronizers
› Never connect asynchronous input to multiple flip-flops as they can resolve at different speeds
Clock skew
› The clock signal can arrive at different times, this is called clock skew
› Clock subject to variation, this is called clock jitter
› While quite well controlled, skew and jitter reduce maximum clock frequency - tclk ≥ tlogic + tsu+ tcq + tskew + tjitter
26
Conclusions
› Studied many different ways to implement FSMs
› Mostly we use - outputs decoded combinatorially from state registers
- one-hot (particularly good for high speed small-medium FSMs on FPGAs)
- RAM/ROM
› Consider metastability when dealing with asynchronous inputs
27
References
› An excellent tutorial on HDL, particularly the “Technology Independent Coding Style” chapter http://www.actel.com/documents/hdlcode_ug.pdf
› Altera White Paper, Understanding Metastability in FPGAs http://www.altera.com/literature/wp/wp-01082-quartus-ii-metastability.pdf
28
Reconfigurable Computing Shift Registers and Random Number Generators
School of Electrical and Information Engineering
http://www.ee.usyd.edu.au/~phwl
Philip Leong ([email protected])
Introduction
› Efficient FPGA design is often nonobvious - with practice, you can learn when to use the special features
› This lecture - Shift registers - Linear feedback shift register (LFSR) - Using RAM for shift registers - True random number generator
Shift register
› Basic building block in digital circuit design
› Features - high speed due to simple logic & interconnection
- low interconnection costs
Shift registers
› Applications - delay line
- serial-parallel conversion
- parallel-serial conversion
- boundary scan
- serial buses
- bit serial signal processing (more about this in future lectures)
Straightforward Implementation
module shift_1x64 (clk, shift, sr_in, sr_out,); input clk, shift; input sr_in; output sr_out; reg [63:0] sr; always@(posedge clk) begin if (shift == 1'b1) begin sr[63:1] <= sr[62:0]; sr[0] <= sr_in; end end assign sr_out = sr[63]; endmodule How many LEs for an N bit SR?
Straightforward Implementation
VHDL
if (clk’event and clk = ‘1’) then
sr <= srin & sr(sr’high downto (sr’low+1));
end if;
How many LEs for an N bit SR?
VERILOG module shift_1x64 (clk, shift, sr_in, sr_out,); input clk, shift; input sr_in; output sr_out; reg [63:0] sr; always@(posedge clk) begin if (shift == 1'b1) begin sr[63:1] <= sr[62:0]; sr[0] <= sr_in; end end assign sr_out = sr[63]; endmodule
RAM
› Remember - Xilinx LUT as a memory has better density than a LUT
- Shift registers only need serial access to the memories
- i.e. do not need all of the outputs in parallel, only need the first input and the last output
Shift register using RAM (4-LUT example)
Address generation: counters
› Ripple carry counter
› Synchronous counter
› Gray code counter
› LFSR
› What are the advantages and disadvantages of each?
Counters
Gray code
LFSR
› Linear feedback shift register - XNOR of certain outputs are fed back to the input
- it is maximum-length if it has a period of 2^N-1
- lets look at a length 4 maximum length LFSR
4 bit Maximal Length LFSR
4 bit LFSR
› How does it count?
4 bit LFSR
› 0, 1, 3, 7, E, D, B, 6, C, 9, 2, 5, A, 4, 8, 0
› period is 2^4-1 = 15
› can be used as pseudorandom number generators
› note it only has a 2 input XOR gate as the critical path! - Compare with an adder’s carry chain
Larger maximal length LFSRs (see XAPP052 for more)
› N=4 (4,3)
› N=8 (8,6,5,4)
› N=16 (16,15,13,4)
› N=32 (32,22,2,1)
› N=64 (64,63,61,60)
› N=128 (128,126,101,99)
› N=168 (168,166,153,151)
Primitive polynomials
› A polynomial P of degree m over a finite field Fp is primitive, iff P is monic, P(0)≠0 and ord(P(x))=pm-1. - Monic means the coefficient of the term with the highest power is 1 - If P(0)≠0, ord(P) is smallest integer e for which P(x) divides xe-1 - E.g. P(x)=x4+x3+1 is primitive
› The feedback values for an XOR based maximum length LFSR correspond to the primitive polynomial
› Programs which compute primitive polynomials - Web based but only supports small n
http://wims.unice.fr/wims/wims.cgi?session=VX756ADCFA.2&+lang=en&+module=tool%2Falgebra%2Fprimpoly.en
LFSR counters
› Maximal LFSR period is 2^N-1 - sometimes we want the period < 2^N-1
- Arbitrary period counter with shorter period can be made by adding a synchronous reset circuit
Virtex SRL
› Virtex has a macro called “Shift Register Lookup Table” (SRL) for implementing SRs
LUT as a mux
LUT as an addressable SR
• Length of SR can be controlled using the D output
Virtex 5
› Virtex 5-7 series slice has 4x 6-input LUTs, each LUT can be a 64x1 or 32x2 RAM
› SRL16x2 and SRL32 primitives are available per LUT (in SLICEM)
SRLC32E
True Random Number Generator
› Introduction
› Ring Oscillator based noise source
› Alternating Sequence Generator
› Results
› Conclusion
Introduction
› Random number generators (RNGs) are in widespread use, main applications are in - Simulation
- Cryptography
› Propose random number generators for these applications - Physical RNG (PRNG) based on oscillator phase noise
What is a random number generator?
› A cryptographically secure random bit generator (CSRBG) is one which produces sequences for which there is no polynomial time algorithm which, on input of the first l bits of the output sequences, can predict the (l+1)st bit of s with a probability significantly greater than 0.5
Applications
› Designed for embedded FPGA applications - Small
- Fast
› For cryptographic applications - Secure
Physical random number generators
› Oscillator sampling
› Direct amplification of transistor/resistor noise
› Discrete time chaos
Oscillator phase noise source
• One method to produce a real random number source in an FPGA is to sample a high frequency signal with a low quality low frequency clock
• If the phase noise of the low frequency oscillator is of the same order as the period of the high frequency clock, output is quite random
Ring Oscillator based noise source
Increased Randomness via Clock Doubling
Poker Test (ONS only)
› Checks that each m-bit string occurs the correct number of times › Passed if the result is between 1.03 and 57.4
From Menezes, “Handbook of Applied Cryptography”
Poker Test (ONS only)
PRNG
› At this point, have a mediocre physical RNG that doesn’t pass Poker test › Can turn into a good RNG by reducing sampling clock rate and passing through
a parity filter to deskew non-uniform distribution (Tsoi, Leung and Leong, FCCM03) - slow
› How can we make a fast one?
Alternating Step Generator
• Selected among many stream ciphers because of its compact implementation
• Recent research has shown weaknesses in this cipher (but any other can be used)
Modified Alternating Step Generator
LFSRs
› 127 and 129 bit LFSRs based on primitive polynomials with approx equal 1 and 0 coefficients - LFSR1(x) = x127 + x102 + x61 + x59 + x58 + x57 + x56 + x55 + x54 + x53 +
x52 + x51 + x50 + x49 +x48 + x47 + x46 + x45 + x44 + x43 + x42 + x41 + x40 + x39 + x38 + x37 + x36 + x35 +x34 + x33 + x32 + x31 + x30 + x29 + x28 + x27 + x26 + x25 + x24 + x23 + x22 + x21 +x20 + x19 + x18 + x17 + x16 + x15 + x14 + x13 + x12 + x11 + x10 + x9 + x8 + x7 +x6 + x5 + x4 + x3 + x2 + x + 1
- LFSR2(x) = x129 + x119 + x62 + x61 + x60 + x59 + x58 + x57 + x56 + x55 + x54 + x53 + x52 + x51 +x50 + x49 + x48 + x47 + x46 + x45 + x44 + x43 + x42 + x41 + x40 + x39 + x38 + x37 +x36 + x35 + x34 + x33 + x32 + x31 + x30 + x29 + x28 + x27 + x26 + x25 + x24 + x23 +x22 + x21 + x20 + x19 + x18 + x17 + x16 + x15 + x14 + x13 + x12 + x11 + x10 + x9+x8 + x7 + x6 + x5 + x4 + x3 + x2 + x + 1
Implementation
› Results for Virtex XCV300E-8 (including computer interface) - Period 7.482ns, 129 slices (4%), 4 BRAM (12%) - High frequency clock can be > 400 MHz (133 MHz used) - Ring oscillator freq > 800 MHz
› Reported results are for no clock doubler but doubled version passes tests as well
› Passes Crush, NIST and Diehard tests
NIST and Diehard Tests
TRNG Summary
› Combines weak RNG with high speed stream cipher to produce a physical noise source
› Clock doubler increases oscillator phase noise › Small area, high output rate, good statistical properties › ASG makes a very low cost implementation, other stream ciphers can be used
Generalized LFSRs
› A variation useful for cryptographic applications - Feedback taps are programmable (can be updated after synthesis)
- Polynomial has high weight so it has many feedback taps requiring high fan-in XNOR gate
› Question: how do we design such a circuit?
Block diagram
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
...
PROGRAMMABLE LUT implementing required XNOR
CLK
CE
DIN
Q1 Q3Q2 Q15
F(Q1,Q2,…,Q15)
Implementing the LUT
LUT 1 LUT 2 LUT 3 LUT 4
LUT 1
Q1 Q4Q3Q2 Q13 Q16Q15Q14Q9 Q12Q11Q10Q5 Q8Q7Q6
F(Q1,Q2,…,Q16)
• SR used to program the LUTs to implement the required feedback arrangement WITHOUT CHANGING THE BITSTREAM
• Can be reduced from 2 levels of logic to 1 by pipelining
Generalized LFSR Floorplan
• Serpentine arrangement reduces wire lengths
LFSR Cell LFSR CellLFSR CellLFSR Cell
LFSR Cell LFSR CellLFSR CellLFSR Cell
LFSR Cell LFSR CellLFSR CellLFSR Cell
LFSR Cell LFSR CellLFSR CellLFSR Cell
Stage 1 pipeline
LFSR cell
Q
QSET
CLR
DSR input SR output
Q
QSET
CLR
D
From SouthConnections
x3
u1x2
x1
f(x1...xn)
To left handtop cell
Leftmost cell hasone input from
each of theflip-flops on the
first row
x3
x4
u1 x2
x1
f(x1...xn)Row 1 Leftmost cell only
Row 1 Non-leftmost cell only
All cells• RAMs used to customize LFSR polynomial (can be loaded at any time)
• Leftmost stage has no pipelining
Pipelining Logic
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
• Must compensate for the extra latency introduced by pipelining
Conclusions
› If we can use Xilinx distributed RAM it gives higher better density - can be used for shift registers, counters & PRNG
- Virtex has a SRL macro for even better efficiency
› LFSR - generate random numbers - high speed nonbinary counters and dividers - Can construct rrue random number generator using oscillator phase noise and the ASG
References
› XAPP052 “Efficient Shift Registers, LFSR Counters and Long PseudoRandom Sequence Generators” and XAPP210 “LFSRs in Virtex Devices” (http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf)
› K.H. Tsoi, Ka Ho Leung, and Philip H.W. Leong. A high performance physical random number generator. IEE Proc. Computers & Digital Techniques, 1(4):349–352, July 2007. (http://www.cse.cuhk.edu.hk/~phwl/mt/public/archives/papers/prng_cdt07.pdf)
Review Questions
› Make sure you understand how the SRL is implemented in an FPGA
› Make sure you can implement an LFSR in VHDL/Verilog
› When is an LFSR more efficient than a counter for counting applications?