reconfigurable computing - university of sydney computing finite state machines school of electrical...

Reconfigurable Computing Finite State Machines

School of Electrical and Information Engineering

http://www.ee.usyd.edu.au/~phwl

Philip Leong ([email protected])

Tortise: You don't need to give an infinite number of monkeys an infinite amount of time to write Hamlet. A finite number of monkeys and a finite amount of time will do just fine. And like my father used to say, never use an infinite number of monkeys when a finite number will do.

- Mike Schiraldi, in a sci.math posting

Lecture Recordings

http://www.ee.usyd.edu.au/people/philip.leong/hit-rc14.html Slides often change just prior to lecture!

Video recording of lecture

Introduction

› Most practical digital designs built from datapath+control -  control implemented as a finite state machine

› Describe how to implement finite state machines -  how to construct a FSM

-  how to optimize for time and area

3

Simple Example: A memory controller

4

State Transition Table

5

Current State Input Condition

Next State Output

Idle ready Decision OE=0,WE=0 Decision ~rw

rw Write Read

OE=0,WE=0

Write ready Idle OE=0,WE=1 Read ready Idle OE=1,WE=0

•  Note that output is only a function of the state (Moore machine) •  If input condition not met, state does not change

Next State Logic Output LogicState Registersinputs next_state current_state outputs

State Assignment

6

Current State Input Condition

Next State Output

Idle ready Decision OE=0,WE=0 Decision ~rw

rw Write Read

OE=0,WE=0

Write ready Idle OE=0,WE=1 Read ready Idle OE=1,WE=0

(s1,s0) (n1,n0) Output 00 01 OE=0,WE=0 01 10

11 OE=0,WE=0

10 00 OE=0,WE=1 11 00 OE=1,WE=0

•  n1 = ~s1.~s0.ready •  n0 = ~s0.~s0.ready+ ~s1.s0.rw •  OE = s1.s0 •  WE = s1.~s0

Implementation

7

•  n1 = ~s1.~s0.ready •  n0 = ~s0.~s0.ready + ~s1.s0.rw •  OE = s1.s0 •  WE = s1.~s0

Speed

8


› Clock period constraint due to setup time -  tclk ≥ tlogic + tsu+ tcq

-  where tlogic is the next state logic delay

-  tsu is the register setup time (time before the clock edge that data must be stable)

-  tcq is the clock to output delay (time after clock edge q might be unstable)

› Clock period constraint due to hold time -  tH ≤ tlogic + tcq ≤ tclk

-  This is most extreme in the case of a shift register

tcq tco tSU

Optimization for speed/area

› Often there is no need -  FPGA usually fast enough so almost any description is acceptable

-  in this case use the most straightforward implementation (easiest to write/read/debug)

-  otherwise giving timing constraints often does the job

› Optimization -  FSM dependent

-  most efficient FSM depends on # states, complexity of logic etc

-  architecture dependent -  different for FPGA/CPLD/ASIC

-  speed max freq?, smallest clock to output delay? -  interrelated but we can optimize one in favor of another

9

Other FSM architectures

› We will look at 4 different implementations -  outputs decoded from state bits combinatorially

-  output decoded in parallel output registers

-  outputs encoded within state bits

-  one hot encoding

10

Outputs decoded from state bits combinatorially

›  This is the technique described earlier

› Output logic is a combinatorial function of the state -  suffers from an additional delay from output of state bits to the output signals


11

Glitching

› Consider a state change from 1111 -> 0000 -  The 4 bits could change at different speeds e.g. 1111 -> 1011 -> 0011 -> 0000

-  Output logic could produce a number of spurious values during current_state transient

-  Wastes power and could affect correct operation

12


Output decoded in parallel output registers

› Decode the outputs before the state bits are registered i.e. at the output of the nextstate

› Outputs are generated earlier than the previous version this is because the output combinatorial logic delay is removed

›  Area larger because we have additional registers for the output ›  Speed we have two combinatorial delays to the output so maximum speed

may be affected ›  Avoids glitching of outputs

Next State Logic State Registersinputs next_state current_state

outputsOutput Logic Output Registers

13

Outputs encoded within state bits

›  An example is a counter - the output is also the state

›  Find a state encoding which directly generates the output -  design more difficult

-  we need to find such an encoding

-  code more complicated and difficult to read

-  can resolve clashes by adding outputs to differentiate similar bits

-  e.g. for the memory controller can add ST0 to differentiate the IDLE and DECISION states

›  Efficiency depends on the particular circuit -  may be more/less efficient in area/speed, best way is to try it for a critical design

Next State Logic State Registersinputs next_state

current_state

outputs

14

Gray code

› Gray code is arranged so that only one output signal changes (no glitching)

› Can code FSMs so that the state transitions follow Gray codes

› Conversion simple g[i] = b[i] xor b[i+1] (b[4] = 0)

15

One hot encoding

› Use n flip-flops to represent an n-state FSM › Significantly reduces the logic for outputs and nextstate

-  why?

-  also reduces logic to generate outputs

-  at the expense of more registers

› FPGAs -  rich in registers

-  for max speed and small to medium FSMs often the best choice

16

Mealy machines

› All FSMs we have discussed are Moore machines since outputs are functions only of the current state -  for Mealy FSMs, outputs functions of current state and inputs

-  output can usually change 1 cycle earlier than Moore

-  can have fewer states than Moore

Next State Logic Output LogicState Registersinputs current_state outputsnext_state

17

Fault tolerance

› What happens if for whatever reason the FSM gets into an undefined state -  fault tolerant designs can recover

-  alternatively could enter an error state that somehow reports

-  explicit “don’t care”

18

Fault tolerance for one-hot designs

›  There are many more possible illegal states for 1-hot designs -  could detect with logic

-  this is quite expensive, affects performance and area

-  could be pipelined

19

BlockRAM Applications, State Machine

Store next address in the ROM

Conditional jump info is being entered as

additional address inputs

One RAM can be split in two: e.g. One 18K dual-port = two 9K BlockRAMs

with independent everything:

( R/W, clock, address, data content, aspect ratio )

High speed, independent of complexity

Slide courtesy of Peter Alfke 20

Fast State Machine in one BlockRAM

256 states, 4-way branch (or 128 states, 8-way) 36 optional parallel outputs from the second port

High speed operation, independent of code

8 bits 8 + 1 bitsBranchControl 2 bits

Output

BlockROM

1K x 9

256 x 368 bits 36 bits

Slide courtesy of Peter Alfke 21

Metastability

› Recall all registers have setup (tSU) and hold (tH) times. Violating these can lead to metastability in which the specified clock-to-output delay (tCO) is violated.

22 Source: Altera

Mechanism

23

›  The mechanism is analagous to a ball dropped on a hill. If setup and hold times are met, the ball is dropped near a stable state.

›  Due to noise, it is eventually resolved but this takes an indeterminate amount of time.

Source: Altera

Modeling Metastability

24

›  The mean time between failure can be modeled as

-  where C1 and C2 are process constants

-  fCLK, fDATA are clock frequency of receiving signal and toggle rate of asynchronous input

-  tMET is timing slack available beyond the register’s tCO, for a potentially metastable signal to resolve to a known value

-  tMET/C2 has largest effect and increases in transistor speed improve metastability MTBF

Synchronization Chains

25

›  Metastability can be virtually eliminated (with an increase in latency) via synchronization chains. tMET is sum of the output timing slacks for each register in the chain -  e.g. if tMET is 50 ps, increasing to 1 ns improves MTBF by e20=5×108 times

›  Signals on different clock domains can be handled with a dual-clock FIFO which contains synchronizers

›  Never connect asynchronous input to multiple flip-flops as they can resolve at different speeds

Clock skew

›  The clock signal can arrive at different times, this is called clock skew

› Clock subject to variation, this is called clock jitter

› While quite well controlled, skew and jitter reduce maximum clock frequency -  tclk ≥ tlogic + tsu+ tcq + tskew + tjitter

26

Conclusions

›  Studied many different ways to implement FSMs

› Mostly we use -  outputs decoded combinatorially from state registers

-  one-hot (particularly good for high speed small-medium FSMs on FPGAs)

-  RAM/ROM

› Consider metastability when dealing with asynchronous inputs

27

References

›  An excellent tutorial on HDL, particularly the “Technology Independent Coding Style” chapter http://www.actel.com/documents/hdlcode_ug.pdf

›  Altera White Paper, Understanding Metastability in FPGAs http://www.altera.com/literature/wp/wp-01082-quartus-ii-metastability.pdf

28

Reconfigurable Computing Shift Registers and Random Number Generators

School of Electrical and Information Engineering

http://www.ee.usyd.edu.au/~phwl

Philip Leong ([email protected])

Introduction

› Efficient FPGA design is often nonobvious -  with practice, you can learn when to use the special features

› This lecture -  Shift registers -  Linear feedback shift register (LFSR) -  Using RAM for shift registers -  True random number generator

Shift register

›  Basic building block in digital circuit design

›  Features -  high speed due to simple logic & interconnection

-  low interconnection costs

Shift registers

›  Applications -  delay line

-  serial-parallel conversion

-  parallel-serial conversion

-  boundary scan

-  serial buses

-  bit serial signal processing (more about this in future lectures)

Straightforward Implementation

module shift_1x64 (clk, shift, sr_in, sr_out,); input clk, shift; input sr_in; output sr_out; reg [63:0] sr; always@(posedge clk) begin if (shift == 1'b1) begin sr[63:1] <= sr[62:0]; sr[0] <= sr_in; end end assign sr_out = sr[63]; endmodule How many LEs for an N bit SR?

Straightforward Implementation

VHDL

if (clk’event and clk = ‘1’) then

sr <= srin & sr(sr’high downto (sr’low+1));

end if;

How many LEs for an N bit SR?

VERILOG module shift_1x64 (clk, shift, sr_in, sr_out,); input clk, shift; input sr_in; output sr_out; reg [63:0] sr; always@(posedge clk) begin if (shift == 1'b1) begin sr[63:1] <= sr[62:0]; sr[0] <= sr_in; end end assign sr_out = sr[63]; endmodule

RAM

› Remember -  Xilinx LUT as a memory has better density than a LUT

-  Shift registers only need serial access to the memories

-  i.e. do not need all of the outputs in parallel, only need the first input and the last output

Shift register using RAM (4-LUT example)

Address generation: counters

› Ripple carry counter

›  Synchronous counter

› Gray code counter

›  LFSR

› What are the advantages and disadvantages of each?

Counters

Gray code

LFSR

›  Linear feedback shift register -  XNOR of certain outputs are fed back to the input

-  it is maximum-length if it has a period of 2^N-1

-  lets look at a length 4 maximum length LFSR

4 bit Maximal Length LFSR

4 bit LFSR

› How does it count?

4 bit LFSR

›  0, 1, 3, 7, E, D, B, 6, C, 9, 2, 5, A, 4, 8, 0

›  period is 2^4-1 = 15

›  can be used as pseudorandom number generators

›  note it only has a 2 input XOR gate as the critical path! -  Compare with an adder’s carry chain

Larger maximal length LFSRs (see XAPP052 for more)

› N=4 (4,3)

› N=8 (8,6,5,4)

› N=16 (16,15,13,4)

› N=32 (32,22,2,1)

› N=64 (64,63,61,60)

› N=128 (128,126,101,99)

› N=168 (168,166,153,151)

Primitive polynomials

› A polynomial P of degree m over a finite field Fp is primitive, iff P is monic, P(0)≠0 and ord(P(x))=pm-1. -  Monic means the coefficient of the term with the highest power is 1 -  If P(0)≠0, ord(P) is smallest integer e for which P(x) divides xe-1 -  E.g. P(x)=x4+x3+1 is primitive

› The feedback values for an XOR based maximum length LFSR correspond to the primitive polynomial

› Programs which compute primitive polynomials -  Web based but only supports small n

http://wims.unice.fr/wims/wims.cgi?session=VX756ADCFA.2&+lang=en&+module=tool%2Falgebra%2Fprimpoly.en

LFSR counters

› Maximal LFSR period is 2^N-1 -  sometimes we want the period < 2^N-1

-  Arbitrary period counter with shorter period can be made by adding a synchronous reset circuit

Virtex SRL

›  Virtex has a macro called “Shift Register Lookup Table” (SRL) for implementing SRs

LUT as a mux

LUT as an addressable SR

•  Length of SR can be controlled using the D output

Virtex 5

› Virtex 5-7 series slice has 4x 6-input LUTs, each LUT can be a 64x1 or 32x2 RAM

› SRL16x2 and SRL32 primitives are available per LUT (in SLICEM)

SRLC32E

True Random Number Generator

›  Introduction

› Ring Oscillator based noise source

›  Alternating Sequence Generator

› Results

› Conclusion

Introduction

› Random number generators (RNGs) are in widespread use, main applications are in -  Simulation

-  Cryptography

›  Propose random number generators for these applications -  Physical RNG (PRNG) based on oscillator phase noise

What is a random number generator?

›  A cryptographically secure random bit generator (CSRBG) is one which produces sequences for which there is no polynomial time algorithm which, on input of the first l bits of the output sequences, can predict the (l+1)st bit of s with a probability significantly greater than 0.5

Applications

› Designed for embedded FPGA applications -  Small

-  Fast

›  For cryptographic applications -  Secure

Physical random number generators

› Oscillator sampling

› Direct amplification of transistor/resistor noise

› Discrete time chaos

Oscillator phase noise source

•  One method to produce a real random number source in an FPGA is to sample a high frequency signal with a low quality low frequency clock

•  If the phase noise of the low frequency oscillator is of the same order as the period of the high frequency clock, output is quite random

Ring Oscillator based noise source

Increased Randomness via Clock Doubling

Poker Test (ONS only)

›  Checks that each m-bit string occurs the correct number of times ›  Passed if the result is between 1.03 and 57.4

From Menezes, “Handbook of Applied Cryptography”

Poker Test (ONS only)

PRNG

›  At this point, have a mediocre physical RNG that doesn’t pass Poker test › Can turn into a good RNG by reducing sampling clock rate and passing through

a parity filter to deskew non-uniform distribution (Tsoi, Leung and Leong, FCCM03) -  slow

› How can we make a fast one?

Alternating Step Generator

•  Selected among many stream ciphers because of its compact implementation

•  Recent research has shown weaknesses in this cipher (but any other can be used)

Modified Alternating Step Generator

LFSRs

› 127 and 129 bit LFSRs based on primitive polynomials with approx equal 1 and 0 coefficients -  LFSR1(x) = x127 + x102 + x61 + x59 + x58 + x57 + x56 + x55 + x54 + x53 +

x52 + x51 + x50 + x49 +x48 + x47 + x46 + x45 + x44 + x43 + x42 + x41 + x40 + x39 + x38 + x37 + x36 + x35 +x34 + x33 + x32 + x31 + x30 + x29 + x28 + x27 + x26 + x25 + x24 + x23 + x22 + x21 +x20 + x19 + x18 + x17 + x16 + x15 + x14 + x13 + x12 + x11 + x10 + x9 + x8 + x7 +x6 + x5 + x4 + x3 + x2 + x + 1

-  LFSR2(x) = x129 + x119 + x62 + x61 + x60 + x59 + x58 + x57 + x56 + x55 + x54 + x53 + x52 + x51 +x50 + x49 + x48 + x47 + x46 + x45 + x44 + x43 + x42 + x41 + x40 + x39 + x38 + x37 +x36 + x35 + x34 + x33 + x32 + x31 + x30 + x29 + x28 + x27 + x26 + x25 + x24 + x23 +x22 + x21 + x20 + x19 + x18 + x17 + x16 + x15 + x14 + x13 + x12 + x11 + x10 + x9+x8 + x7 + x6 + x5 + x4 + x3 + x2 + x + 1

Implementation

› Results for Virtex XCV300E-8 (including computer interface) - Period 7.482ns, 129 slices (4%), 4 BRAM (12%) - High frequency clock can be > 400 MHz (133 MHz used) - Ring oscillator freq > 800 MHz

› Reported results are for no clock doubler but doubled version passes tests as well

› Passes Crush, NIST and Diehard tests

NIST and Diehard Tests

TRNG Summary

› Combines weak RNG with high speed stream cipher to produce a physical noise source

› Clock doubler increases oscillator phase noise › Small area, high output rate, good statistical properties › ASG makes a very low cost implementation, other stream ciphers can be used

Generalized LFSRs

›  A variation useful for cryptographic applications -  Feedback taps are programmable (can be updated after synthesis)

-  Polynomial has high weight so it has many feedback taps requiring high fan-in XNOR gate

› Question: how do we design such a circuit?

Block diagram

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

...

PROGRAMMABLE LUT implementing required XNOR

CLK

CE

DIN

Q1 Q3Q2 Q15

F(Q1,Q2,…,Q15)

Implementing the LUT

LUT 1 LUT 2 LUT 3 LUT 4

LUT 1

Q1 Q4Q3Q2 Q13 Q16Q15Q14Q9 Q12Q11Q10Q5 Q8Q7Q6

F(Q1,Q2,…,Q16)

•  SR used to program the LUTs to implement the required feedback arrangement WITHOUT CHANGING THE BITSTREAM

•  Can be reduced from 2 levels of logic to 1 by pipelining

Generalized LFSR Floorplan

•  Serpentine arrangement reduces wire lengths

LFSR Cell LFSR CellLFSR CellLFSR Cell




Stage 1 pipeline

LFSR cell

Q

QSET

CLR

DSR input SR output

Q

QSET

CLR

D

From SouthConnections

x3

u1x2

x1

f(x1...xn)

To left handtop cell

Leftmost cell hasone input from

each of theflip-flops on the

first row

x3

x4

u1 x2

x1

f(x1...xn)Row 1 Leftmost cell only

Row 1 Non-leftmost cell only

All cells•  RAMs used to customize LFSR polynomial (can be loaded at any time)

•  Leftmost stage has no pipelining

Pipelining Logic

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

•  Must compensate for the extra latency introduced by pipelining

Conclusions

›  If we can use Xilinx distributed RAM it gives higher better density -  can be used for shift registers, counters & PRNG

-  Virtex has a SRL macro for even better efficiency

›  LFSR -  generate random numbers -  high speed nonbinary counters and dividers -  Can construct rrue random number generator using oscillator phase noise and the ASG

References

›  XAPP052 “Efficient Shift Registers, LFSR Counters and Long PseudoRandom Sequence Generators” and XAPP210 “LFSRs in Virtex Devices” (http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf)

›  K.H. Tsoi, Ka Ho Leung, and Philip H.W. Leong. A high performance physical random number generator. IEE Proc. Computers & Digital Techniques, 1(4):349–352, July 2007. (http://www.cse.cuhk.edu.hk/~phwl/mt/public/archives/papers/prng_cdt07.pdf)

Review Questions

› Make sure you understand how the SRL is implemented in an FPGA

› Make sure you can implement an LFSR in VHDL/Verilog

› When is an LFSR more efficient than a counter for counting applications?

reconfigurable computing - university of sydney computing finite state machines school of electrical...

Documents