hardwarealgorithms mse: parallelization - bfh · mse: hardwarealgorithms parallelization...

Mse: Hardware Algorithms

Parallelization

Marcel JacometJosef Goette

Bern University of Applied SciencesBfh-Ti HuCE-microLab, Biel/Bienne

[email protected]

October 11, 2017

Contents

1 Introduction 1

2 Parallelization 2

3 Unfolding 9

4 Hardware Rules 14

5 OCT Example 155.1 OCT Introduction . . . . . . . . . . . . . . . . . 15

6 Parallelization at OCTExample 296.1 Data-Path Unfolding . . . . . . . . . . . . . . . . 296.2 FiFo Unfolding . . . . . . . . . . . . . . . . . . . 316.3 DFT Unfolding . . . . . . . . . . . . . . . . . . . 33

References 38

Hardware Algorithms

c© Marcel Jacomet, 2012 - 2016

All rights reserved. This work may not be translated or copied in

whole or in part without the written permission by the author, except

for brief excerpts in connection with reviews or scholarly analysis.

Use in connection with any form of information storage and retrieval,

electronic adaptation, computer software is forbidden.

Marcel Jacomet ii 2008

Hardware Algorithms

1 Introduction

Marcel Jacomet 1 2008

Hardware Algorithms

Textbooks

• Vlsi Digital Signal Processing Systems, Design and Im-plementation, Keshab K. Parhi, John Wiley & Sons,Isbn 0-471-24186-5, 1999, USD 135

• Oct texts discussing the lab example can be found onthe web

2 Parallelization


Hardware Algorithms

Parallelization Principles 1

• parallelization at degree p speeds up hardware algorithmsby up to factor p

• parallelization of hardware basically can be done in twoways:

– p identical hardware paths executing time delayeddata-streams in parallel

– p interlinked hardware paths executing a stream ofdata vectors of length p data sets in parallel

• the first approach is a straight forward implementationusing p times the number of non parallelized hardware

• the second approach is more challenging, using p times thenumber of operators of the non parallelized hardware, butthe identical number of storage elements only


Hardware Algorithms

Parallelization Principles: Parallel Streams


Hardware Algorithms

Parallelization Principles: Parallel Sets

data sampling channel 1





data sample(5 set vector)

interlinked parallel processing of samples (vectors)


Hardware Algorithms

Dataflow Graph Representation

y[n] = a · x[n] + b · x[n− 1] + c · x[n− 2]

• block diagram of 3-tap FIR filter

1z

1z

y[n]

x[n-2]x[n-1]x[n]

a b c

• data-flow diagram of 3-tap FIR filter

y[n]

x[n]

a b c

D 2D


Hardware Algorithms

Dataflow Graph: Pipelining

• pipelining is done by introducing additional delay elements(registers)

• pipelining delays elements can only be set in feed-forwardpaths

y[n]

x[n]

a b c

D2D

y[n]

x[n]

a b c

D3D

D


Hardware Algorithms

Dataflow Graph: Pipelining for Speedup

• pipelining to increase clock frequency

• retiming theory (Bellman-Ford or Floyd-Warshall algo-ithms)

• Fir example: frequency is 1/(2u) instead of 1/(4u)

y[n]

x[n]

a b c

D2D

(2u) (2u) (2u)

(1u) (1u)

y[n]

x[n]

a b c

D2D

(2u) (2u) (2u)

(1u) (1u)D

D D

y[n]

x[n]

a b c

D D

(2u) (2u) (2u)

(1u) (1u)D

D D


Hardware Algorithms

3 Unfolding


Hardware Algorithms

Unfolding 1

• unfolding or loop unrolling

• example

y[n] = a · y[n− 9] + x[n]

1: for i← 1, to ∞ do2: y[i]← a · y[i− 9] + x[i]

• replacing index n by 2k and n+ 1 by 2k + 1

• together, the 2 equations describe the same algorithm

y[2k] = a · y[2k − 9] + x[2k]

y[2k + 1] = a · y[2k − 8] + x[2k + 1]


Hardware Algorithms

Unfolding 2

• parallelization degree: J-slow

• J-slow means that for an input x[kJ +m] the output aftera delay is x[(k − 1)J +m]

• thus we get:

y[2k] = a · y[2(k − 5) + 1] + x[2k]

y[2k + 1] = a · y[2(k − 4) + 0] + x[2k + 1]


Hardware Algorithms

Unfolding 3

• data flow graph of example

• algorithm of example (2-slow)

x[n]

a

9D

y[n]

x[2k+1]

a

4D

x[2k]

a

5D

y[2k+1]

y[2k]


Hardware Algorithms

Unfolding Design Procedure

• for each node U in the original Dfg, draw the J nodesU0,U1, · · · , UJ−1

• for each edge U → V with w delays in the original Dfg,draw the J edges Ui → V(i+w)mod (J) with ⌊ i+w

J ⌋ delaysfor i = 0, 1, 2, · · · , J − 1

U0

U1

U2

V0

V1

V2

T0

T1

T2

U V

T

D

6D

5D

D

D

2D

2D

2D

2D

2D


Hardware Algorithms

4 Hardware Rules

Signal Processing Hardware Rules: ”No Control Path”

• 1/z register stores at every clock cycle a new input sample

• if clause asks for controllable registers (with enable)

• let’s built it in Simulink: hardware rule

1z

Unit Delay

Register

D

clk

Q 1z

Unit Delay

Register

D

clk

Q

u

E

1z

Unit Delay

y

Enabled

1z

Unit Delay

Register

D

clk

Q

EnabledRegister

D

clk

Q

ena u

E

1z

Unit Delay

y

Enabled

1z

Unit Delay

Register

D

clk

Q

EnabledRegister

D

clk

Q

ena

1z

Unit Delay

~=0

Switch

Register

D

clk

Q

EnabledRegister

D

clk

Q

ena

1z

Unit Delay

ena

DQ

1z

Unit Delay

~=0

Switch

1z

Unit Delay

ena

D


Hardware Algorithms

5 OCT Example

5.1 Introduction to OCT


Hardware AlgorithmsOptical coherence tomography (Oct) is an optical signalacquisition and processing method. It captures micrometer-resolution, three-dimensional images from within optical scat-tering media (e.g., biological tissue). Optical coherence tomog-raphy is an interferometric technique, typically employing near-infrared light. The use of relatively long wavelength light allowsit to penetrate into the scattering medium. Reflection is causedby refraction index changes at tissue boundaries and scatteringis a diffraction process at micro-structures in the tissue. Oct

signals only contain information about the depth of scattering orreflecting structures and cannot differentiate between these twofundamental processes. A relatively recent implementation ofoptical coherence tomography, frequency-domain optical coher-ence tomography, provides advantages in signal-to-noise ratio,permitting faster signal acquisition. Optical coherence tomog-raphy systems are employed in diverse applications, includingart conservation and diagnostic medicine, notably ophthalmol-ogy where it can be used to obtain detailed images from withinthe retina. Advantages compared to other techniques are theachieved tissue penetration (1 to 3 mm) combined with the rel-ative high axial resolution (0.5 to 15 mm) at a very high mea-suring frequency (several 100 kS/s).

Introduction to OCT: Features

• Oct is an optical signal acquisition and processing method

• micro-meter resolution in 3-D images

• optical scattering/reflecting media: biological tissues

• interferometric technique with near infrared laser


Hardware Algorithms• reflection is caused by refraction index changes at tissueboundaries

• recent Oct technology is frequency domain Oct provideslow Snr and high speed signal acquisition


Hardware Algorithms

Introduction to OCT: Applications

• applications in medicine: ophthalmology, ...

• depth penetration of 1 to 3 mm (A-scan)

• speeds of 100 kS/s per depth scans at 2048 pixels, ≥ 200MS/s

• Oct image of pig eye atHuCE-optoLab (left), Oct setupwith Gecko platform at HuCE-microLab (right)


Hardware AlgorithmsThe optical setup for frequency-domain Oct typically con-sists of an interferometer with a low coherence, broad band-width light source (white light) or a narrow band sweeping lightsource. Light is split into and recombined from reference andsample arm, respectively.

Introduction to OCT: Principle

• low coherence source (Lcs)

• beam splitter (Bs)

• reference (Ref) and sample arm (Smp)

• diffraction grating (Dg) and full field camera Cam) asspectrometer (source wiki)


Hardware AlgorithmsThe measured input samples received by the digital signalprocessing units are equidistant to the wavelength (x-axis is thewavelength, y-axis is the measured Oct light intensity). A firststep in the Oct processing is to remap the measured light in-tensity equidistant to the wave number instead to the wave-length. This pre-processing step is needed for a succeeding Dft

transformation. Use simple linear interpolation to calculate theremapped sample intensity.

Introduction to OCT: Signals

• top: captured fourier domain Oct signals of A-scan

• middle: signals after filtering and remapping

• bottom: final A-scan image after inverse Fft

0 200 400 600 800 1000 12000

1

2

3

wave length [nm]

Inte

nsity

a.u

.

7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75

−0.5

0

0.5

1

wave number [1/um]

Inte

nsity

a.u

.

−1000 −800 −600 −400 −200 0 200 400 600 800 10000

0.05

0.1

0.15

0.2

depth z [um]

Inte

nsity

a.u


Hardware Algorithms

Signal Processing in OCT: Remapping 1

• Oct input signals are captured in λ (wave length) domain

• they have to be transformed into k (wave number) domain

• this process is called remapping

7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75

7.25

7.3

7.35

7.4

7.45

7.5

7.55

7.6

7.65

7.7

7.75

camparison of k (linear) and k = 2*pi/lambda(n)

linear k


Hardware Algorithms


• λ (wave length) from 810 nm to 870 nm

• λ equidistant sampling in wave length: Ln

• λ equidistant sampling in wave number: Lm

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1

L (equidistant in L)

L (equidistant in k)

Lstep

valA

valBout(m)

input signal

remapped signal

• relation is: k = 2π/λ withLstep =

λmax−λminN Ln = λmin + n · Lstep

kstep =2π

λmin−

2π

λmaxN Lm = 2π

kmax−m·kstep


Hardware Algorithms


• signal processing with look-up table

– no division with iteration

– no error due to continuous summing

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1



Lstep

valA

valBout(m)

input signal

remapped signal

outm = valA+ (valB−valA)Lstop

· (Lm − Ln)

outm = valA+ (valB− valA) · LUTk(addr)


Hardware Algorithms

Signal Processing in OCT: Control Path

• signal processing: data path and control path

– for clause would be perfect

– if clause in code asks for control path

– control can also be done by look-up tables

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1



Lstep

valA

valBout(m)

input signal (equidistant sampling in wave length)

remapped signal (equidistant sampling in wave number)

Lm+2

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1



Lstep

valA

valBout(m)

input signal (equidistant sampling in wave length)

remapped signal (equidistant sampling in wave number)

Lm+2

1x 2x 0x


Hardware Algorithms

Signal Processing in OCT: Datapath and Control Path

1: i← 1, j ← 1, m← 1, adr ← 12: while m ≤ 1024 do3: varA← inp[i]4: varB ← inp[i+ 1]5: if lutCtr(adr − 1) 6= 2 then6: outm(j)← varA+ (varB − varA) ∗ lutK(adr)7: if lutCtr(adr) = 0 increment input and output sample

index then8: m← m+ 19: i← i+ 1

10: else if lutCtr(adr) = 3 keep, do not load new input sam-ple then

11: m← m+ 112: else if lutCtr(adr) = 2 skip, do not generate output sam-

ple then13: i← i+ 114: adr ← adr + 1


Hardware Algorithms

Signal Processing in OCT: Simulink



Hardware Algorithms

Signal Processing in OCT: ”No Control Path”



Hardware Algorithms

Signal Processing in OCT: Simplifications in ControlPath



Hardware Algorithms

6 Parallelization at OCT Example

6.1 Data-Path Unfolding


Hardware Algorithms

Unfolding: OCT Example 1

• OCT data flow graph for interpolation

• exercise: design a 4-slow unfolding

• simulate it with Matlab/Simulinik

in Mux

wr

Mux

wr

+

- *out+

D

D

D

D

D

D

D

lutKlutCTR


Hardware Algorithms

Unfolding: How to Model the FiFo?

• OCT data flow graph for interpolation

• exercise: 4-slow unfolding inlcuding control path

• what about the FiFos?

in Mux

wr

Mux

wr

+

- *Mux

wr

out+

not 3 not 2

+

LUT ctr

LUT k1

D

D

D

D

D

D

D

D

D

D

D

2D 3D

D

D

1

?? D

push pop

FiFo ??

push pop

FiFo

6.2 FiFo Unfolding


Hardware Algorithms

FiFo Model

• Dfg model of a FiFo

• the FiFo has to be decomposed downto delay elementsand combinational logic

push pop

FiFo

Mux

wr

D

D Mux

wr

D

D

push pop

dual portRAM

in out

adrWadrRD

D

1

D

D

1

in out


Hardware Algorithms

Unfolding the FiFo Model

• Dfg model of an 2-slow unfolding of FiFos

• impossible to compose again FiFos

• shall we start to re-implement all IP cores?

Mux

wr

Mux

wr

push

pop

dual portRAM

in out

adrWadrR

1

D

1

inout

Mux

wr

D

Mux

wr

pushpop

dual portRAM

in out

adrWadrR

11

inout

D D

D

D

6.3 DFT Unfoldingl


Hardware Algorithms

Dft (Dtfs): Discrete Fourier Transform

• natural parallelization by Fft algos

• N -point Dft

X[k] =

N−1∑

n=0

x[n]W knN , k = 0, 1, 2, . . . , N − 1

where WN =̂ Nth root of unity

WN =N√1 = e−j(2π/N)

• inverse transform

x[n] =1

N

N−1∑

k=0

X[k]W−knN , n = 0, 1, 2, . . . , N − 1

We need a note on the factor 1/N .


Hardware Algorithms

Dft: Matrix Form

• denote the vector of input samples by

x =(

x[0] , x[1] , x[2] , . . . , x[N − 1])T

• denote the vector of spectral samples by

X =(

X[0] , X[1] , X[2] , . . . , x[N − 1])T

• then the Dft can be written as

X = DFT (x) = Fx

with F =̂

1 1 1 · · · 1

1 WN W 2N · · · WN−1

N

1 W 2N W 2·2

N · · · W2·(N−1)N

...

1 WN−1N W

(N−1)·2N · · · W

(N−1)·(N−1)N

Superscript T denotes transpose.


Hardware Algorithms

Dft: Low-Order Fourier Matrix Examples

• for N = 2: WN = W2 = 2√1 = e−j2π/2 = e−jπ = −1

F2 =̂

(

1 1

1 W2

)

=

(

1 1

1 −1

)

• for N = 4: WN = W4 = 4√1 = e−j2π/4 = e−jπ/2 = −j

F4 =̂

1 1 1 1

1 W4 W 24 W 3

4

1 W 24 W 2·2

4 W 2·34

1 W 34 W 3·2

4 W 3·34

=

1 1 1 1

1 −j −1 j

1 −1 1 −11 j −1 −j

Superscript T denotes transpose.


Hardware Algorithms

Dft: Matrix Factorization ❀ Fft

• for example N = 1024:

F1024 =̂

(

I512 D512

I512 −D512

)

·(

F512 O

O F512

)

·(

even

odd

)

where I512 =̂ identity matrix

D512 =̂ diag{

1,W1024,W21024, . . . ,W

5111024

}

F512 =̂ 512-point Fourier matrix

permutation at end separates even and odd part:

(↓)x =(

x[0] , x[2] , . . .)

(↓) (z)x =(

x[1] , x[3] , . . .)


Hardware AlgorithmsReferences


hardwarealgorithms mse: parallelization - bfh · mse: hardwarealgorithms parallelization...

Documents